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ABSTRACT 


This book covers technologies, applications, tools, languages, procedures, advantages, and disad- 
vantages of reconfigurable supercomputing using Field Programmable Gate Arrays (FPGAs). The 
target audience is the community of users of High Performance Computers (HPC) who may benefit 
from porting their applications into a reconfigurable environment. As such, this book is intended to 
guide the HPC user through the many algorithmic considerations, hardware alternatives, usability 
issues, programming languages, and design tools that need to be understood before embarking on 
the creation of reconfigurable parallel codes. We hope to show that FPGA acceleration, based on 
the exploitation of the data parallelism, pipelining and concurrency remains promising in view of 
the diminishing improvements in traditional processor and system design. 
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Preface 


This book covers technologies, applications, tools, languages, procedures, advantages, and dis- 
advantages of reconfigurable supercomputing using Field Programmable Gate Arrays (FPGAs). The 
target audience is the community of users of High Performance Computers (HPC) who may benefit 
from porting their applications into a reconfigurable environment. As such, this book is intended to 
guide the HPC user through the many algorithmic considerations, hardware alternatives, usability 
issues, programming languages, and design tools that need to be understood before embarking on 
the creation of reconfigurable parallel codes. 

However, this book is not intended to teach how to use and program FPGAs. In particular, 
this document does not replace the specific documentation provided by the vendors for any of the 
FPGA programming languages and tools. Before attempting to program a device, the reader is 
encouraged to read and study the latest documentation for each of the chosen tools. 

The first chapter begins with a brief discussion of the technology behind FPGA devices. T'his 
discussion covers the basic architecture of an FPGA in terms of logic blocks and interconnects, 
as well as the programming of these devices. Chapter 2 explains how FPGAs are embedded in 
supercomputing architectures and how to create reconfigurable parallel codes. The focus of the 
second chapter is the architecture and functionality of the Cray XD1 reconfigurable supercomputer. 
Chapter 3 presents a series of algorithmic considerations that are necessary to determine when a 
computational code is a good candidate for FPGA acceleration. Chapter 4 offers an overview of 
some of the most widely used FPGA programming languages: VHDL, DSPLogic, Mitrion-C, 
and Handel-C. Chapter 5 provides a detailed study of how to use reconfigurable supercomputing 
techniques to accelerate a variety of data sorting algorithms. Finally, in Chapter 6, we discuss recent 
alternative technologies and summarize our experience with FPGAs. 
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Introduction 


Recent years have witnessed reconfigurable supercomputing spearheading a radically new area 
in high performance computer design. Reconfigurable supercomputers typically consist of a parallel 
architecture where each computational node is made of traditional ASIC CPUs, working alongside a 
Field Programmable Gate Array (FPGA) in a master-slave fashion. Reconfigurable supercomputers 
appear to be of relevance to a variety of codes of interest to the scientific, financial, and defense 
communities. 

The distinctive feature of an FPGA is that its hardware configuration takes place after the 
manufacturing process. This means that the user is no longer limited to a fixed, unchangeable, and 
predetermined set of hardware functions. That is, an FPGA can create a temporary hardware unit that 
specifically conducts the types of operations needed by an application. For improved performance, 
the FPGA is exclusively used to accelerate portions of the code that exhibit a large degree of data 
parallelism, instruction concurrency, and pipelining. 

Therefore, using FPGA technology, supercomputers are able to modify their logical circuits at 
runtime. As a consequence, these machines can, in principle, work with an innumerable number of 
circuits designs that are best suited to the specific application of interest. In this regard, it is important 
to mention that reconfigurable supercomputers have been benchmarked to provide dramatic increases 
in computational performance for selected examples as well as increased power efficiency [34]. 
Specifically, this technology has been proved to be more effective for certain applications in the areas 
of cryptology, signal analysis, image processing, searching and sorting, and bioinformatics. 

'The co-founders of Xilinx, Ross Freeman, and Bernard Vonderschmitt, invented the FPGA 
in 1985. Subsequently, Steve Casselman of the US Naval Surface Warfare Center created the first 
design of a reconfigurable computer implementing 600,000 FPGAs during the late 1980s [43]. 
Hence, reconfigurable supercomputing using FPGAs is a relatively new technology. 

Reconfigurable supercomputing poses many challenges. In particular, parallel programming is 
not trivial even without FPGAs. Because of the reconfigurable nature of FPGAs, programmers must 
deal with a variety of hardware and software considerations while designing efficient reconfigurable 
codes. To date, there is no compiling option that automatically enables FPGA acceleration. 

In general, the goal for the software designer of reconfigurable applications is to exploit 
the specific benefits, flexibility, and efficiency of this technology while avoiding its many known 
shortcomings. Integrating FPGAs in system design in a way that reduces the overhead of using 
them will significantly advance the technology. Features such as shared memory are needed to reduce 
communication costs. Either the time needed to load logic needs to be reduced or the possibility of 
switching FPGAs needs to be added to make runtime reconfiguration useful. 


2 INTRODUCTION 


Furthermore, all parallel codes will not benefit from FPGA acceleration. Programmers must 
carefully analyze their applications to determine if they are good candidates for FPGA acceleration. 
Typically, code implemented on an FPGA is a compute-intensive kernel, which is a bottleneck of 
an application. Testing involves rewriting major portions of code for both the host processor and the 
co-processor. By any means, reconfigurable software development is a complex and time-consuming 
task for the neophyte FPGA programmer. 


CHAPTER. 1 


FPGA Technology 


The FPGA is a digital integrated circuit that can be configured and programmed by the user to 
perform a variety of computational tasks [19]. These units work with integers or floating numbers, 
perform several types of arithmetic and logical operations, access and reference memory locations, 
and carry out sophisticated control structures involving loops and conditionals. Arguably, the way in 
which FPGAs merge hardware and software is what makes them unique. Consequently, the design 
of a reconfigurable algorithm will involve a variety of software and hardware considerations. Indeed, 
the programmer needs to understand not only the algorithmic structure of the problem to be solved 
but also the important details about the configuration of the hardware for optimal performance. 


11 ASIC VS. FPGA 


To better appreciate the differences between programming traditional devices and reconfigurable 
units, let us think about the differences between traditional hardware and FPGAs. With traditional 
Application Specific Integrated Circuits (ASIC), we can design, compile, and run an arbitrary com- 
puter program to perform an arbitrary computation. However, the translation to machine executable 
controls, made by the compiler, is restricted to the existing operations available inside the CPU. 

On the other hand, the hardware configuration of an FPGA takes place after the manufac- 
turing process, which means that programming is no longer limited to a fixed, unchangeable, and 
predetermined set of hardware functions. Hence, an FPGA becomes a temporary hardware unit that 
specifically conducts the types of operations needed by an application. This capability justifies the 
"field programmable" terms in FPGA. 

Indeed, a traditional ASIC CPU has a fixed and determined number of integer arithmetic 
units and floating point units, which may be in use or idle at any moment in time, depending on 
the specific mathematical or logical operation being carried out. Clearly, ASIC CPUs are designed 
for flexibility to tackle a wide variety of computational problems. Yet, these CPUs present many 
challenges to efficiently utilize their hardware resources. 

In particular, the architecture of the ASIC CPU may cause a series of delays due to hardware 
conflicts. In other words, an operation may not be carried out until a hardware unit becomes available. 
However, an FPGA can, in principle, be programmed in such a way that all ofits configured hardware 
resources will be available at all times. Furthermore, FPGAs offer the potential of an extremely high 
level of utilization of hardware resources. 

For example, let us suppose that we have an application that extensively uses integer additions 
and not much else. If we use a traditional ASIC CPU, then our application is basically restricted 
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to the number of integer arithmetic adders available while multipliers and other hardware functions 
remain idle for most of the computational time. On the other hand, if we program this application 
using an FPGA, then we can configure the device to consist of mainly integer arithmetic adders. 
Then, the entire hardware resources available inside an FPGA will be busy most of the time. 


12. PHYSICAL ARCHITECTURE 


In a nutshell, an FPGA is merely a semiconductor device made of a grid of programmable logic 
components (Jogic blocks) tied together by means of programmable interconnects [10]. The logic blocks 
are able to perform combinational logic and flip flops for seguential logic operations. About 1096 of 
the FPGA is made of logic blocks. 

The other 9096 of the FPGA is made of the programmable interconnects, which form a 
routing architecture that provides arbitrary wiring between the logic blocks. A simplified abstract 
view of an FPGA is shown in Figure 1.1. The sguare boxes represent the logic blocks and the lines 
correspond to the interconnect structure. 
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Figure 1.1: Abstract view of an FPGA. The square boxes represent the logic blocks (LB) and the straight 


lines correspond to the interconnect structure. 


13 LOGIC BLOCKS 


As mentioned before, the logic blocks inside an FPGA can be programmed to perform a variety of 
functions. In particular, they can be used to carry out the following functions: 


* A standard set of logical gates such as AND and XOR. 


1.3. LOGIC BLOCKS 5 


* Complex combinational mathematical functions. 


* High level arithmetic and control structures such as integer multipliers, counters, decoders, 
and many more. 


The great flexibility of the logic blocks is accomplished through the extensive use of Look Up 
Tables (LUTs). The structure of these LUTs is replicated multiple times over the area of the FPGA. 
As such, LUTs are the basic building block of the logical structure of an FPGA. 

As their name suggests, LUTs are simply hardware implementations of logical truth tables. 
Furthermore, truth tables can be used to represent arbitrary Boolean functions. As such, the FPGAs 
can be programmed to carry out sophisticated algorithms by using truth tables that represent the 
Boolean functions relevant to the program. 
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As an example, let us consider the case of a very simple Boolean function, the 3-bit XOR. This 
function is represented by f(a,b,c) - a XOR b XOR c, whose output depends on the eight possible 
inputs as shown in a truth table in Table 1.1. 

An abstract diagrammatic representation of the hardware implementation of the 3-bit XOR 
as an LUT can be seen in Figure 1.2. The crossed vertical line in the lower part of the figure denotes 
an input of 3 bits. Depending on the value of the 3-bit input, the LUT selects the value of the 
function from the column at the left-hand side of the figure. That is, if the input is 000, then the 
LUT outputs 0, the number of the column with an address 000; if the input is 001, then the LUT 
outputs 1, the number of the column with an address 001; and so on. 

Those familiar with logical circuits will appreciate that the LUT is being implemented as a 
multiplexer unit. That is, a multiplexor is a device that selects one of many input signals and sends it 
as a single output signal. In general, a multiplexer with 2" available inputs requires a selector made of 
n bits. The value of the selector is the address of the input being sent to the output of the multiplexer. 

Clearly, an arbitrary Boolean function can be computed with an LUT. However, the LUT 
has to be large enough to fully describe all of the possible inputs of the Boolean function. A simple 
logical gate such as an XOR may allow four inputs described by a 2-bit input signal, but a more 
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Figure 1.2: Diagrammatic representation of the hardware implementation of the 3-bit XOR as an LUT. 


sophisticated Boolean function operating on 32-bit integers will permit 23? inputs described by a 
32-bit input signal. In general, an n-LUT is an LUT that selects one out of 2" input signals using 
an n-bit selector signal. 

In the context of FPGA design, there are advantages and disadvantages of using large or 
small LUTs. For instance, FPGAs with large LUTs permit the easy description of complex logical 
operations and require less wiring in the interconnect. On the other hand, FPGAs with large LUTs 
may yield sluggish performance because they require larger multiplexers. Also, they may end up 
wasting valuable hardware resources if simple Boolean functions are programmed inside a large 
LUT. 

If the FPGA uses small LUTS, then the programming of a Boolean function may require a 
large number of logical blocks and extensive wiring in the interconnect. Furthermore, the extensive 
wiring between blocks may be a cause of delay, which may lead to slower computational performance. 

Therefore, LUTs inside an FPGA cannot be too big or too small. To date, it appears that 
the most optimal and efficient trade-off between the size of the LUT and the required wiring in 
the interconnect is a 4-LUT. Nevertheless, recent FPGA architectures such as the Virtex-5 use 6- 
LUTS. Furthermore, optimal operational design of FPGAs often groups more than a single 4-LUT. 
In general, an FPGA s/ice is usually made of a pair of 4-LUTs or 6-LUTs. 

It is important to note that the abstract representation given in Figure 1.2 is not completely 
accurate. The multiplexer does not store state information within the logic block. Hence, it is 
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impossible to use it to perform any type of sequential or state-holding logical operations. The solution 
to this deficiency is easy. As shown in Figure 1.3, the multiplexer requires a single-bit storage element 
implemented with a D flip-flop. This flip-flop simply stores the result of the 4-LUT until the next 
signal. 


4-LUT 


DLK 
D Flip-Flop 


Figure 1.3: Logic block design, including a D Flip-Flop unit for single-bit storage. 


14 THEINTERCONNECT 


As previously discussed, the interconnect is configured to properly route all of the signals between 
the logic blocks inside an FPGA. In general, large computations are broken into simpler operations 
that encompass several logic blocks. Then, the interconnect is used to gather the results of the 
individual computations that were performed by all the LUTs involved in the computation. An 
FPGA can perform arbitrary computations as long as the logic blocks and the interconnect are large 
enough to fully describe all of the required logical operations. There are several ways to structure the 
interconnect: 


* Nearest Neighbor. 
* Segmented. 
* Hierarchical. 


The simplest of all are the nearest neighbor structures. As their name suggests, nearest neighbor 
interconnect structures only connect logic blocks to the nearest logic blocks. Unfortunately, this 
structure often leads to severe delays in passing information from one side of the FPGA to another 
(a delay that scales linearly with the distance required to be traversed) and also incurs a wide variety 
of connectivity issues (signals passing through a logic block should not interfere with new signals 
being produced inside the logic block being traversed). 
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Segmented structures are a more sophisticated way to implement the interconnect. In addition 
to LUTS, these structures rely on connection blocks and switch boxes to provide increased routing 
flexibility. Finally, Aierarchical structures group together logic blocks in a segmented structure ac- 
commodated in a hierarchical manner. In practice, most modern FPGA architectures implement a 
variant of segmented or hierarchical interconnect structures. 
































Nearest Neighbor Segmented Hierarchical 


Figure 1.4: Nearest Neighbor, Segmented and Hierarchical interconnect structures. Squares represent 
logic blocks, lines the interconnect, and the triangles are connection and switching boxes. 


15 MEMORYAND I/O 


While the basic components ofan FPGA are the logic blocks and the interconnect, most commercial 
FPGAs have a slightly more sophisticated architecture. New components need to be added to the 
architecture in order to properly connect the FPGA within the framework of a computer system. As 
mentioned in the introduction to this lecture, the FPGAs in reconfigurable computing often work 
under the direction of a host CPU in a slave-master fashion. 

First of all, FPGAs require having access to local memory resources in the form of a local 
RAM memory. In this context, local memories in the FPGAs are conceptually equivalent to a cache 
in a traditional ASIC CPU. Second, an FPGA requires ports for input and output (I/O). These I/O 
ports connect the FPGA to a host CPU. The programming of an FPGA and the numerical inputs 
required for computations are conveyed through the I/O ports that connect the FPGA to a host 
CPU and external memories. 
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Even more sophisticated FPGA architectures are currently available in the market. For in- 
stance, FPGAs may include processor blocks and functional components that automatically enable 
sophisticated arithmetic functions that are commonly used to solve computational problems. In 
particular, it is not rare to see in modern FPGAs the inclusion of integer multipliers. 


1.6 RECONFIGURABLE PROGRAMMING AND FPGA CON- 
FIGURATION 


Clearly, reconfigurable programming requires the FPGA to be configured to perform the desired 
computations. Thus, a reconfigurable program is executed by a traditional CPU, which configures and 
harnesses an FPGA. In this regard, every single logic block and routing path in the interconnect 
found inside the FPGA can be controlled by using a sequence of memory bits. An FPGA program 
may be viewed as a binary file with a series of memory bits that are used to configure, bit by bit, 
the logic blocks and interconnects inside the FPGA. The FPGA program is often referred to as the 
bitstream, and it is only a part of a reconfigurable program. 

Let us first consider the case of the logic blocks. If we wish to program a single logic block 
inside an FPGA, then we need to specify all of the programmable points in the architecture of the 
logic block. By looking at Figure 1.3, we can identify these components and how many programming 
bits are required: 


* We require 16 bits to program the 2^ possible output values of the 4-LUT. 
* The select signal of the multiplexer requires 1 bit. 
* The initial state of the D flip-flop requires 1 bit. 


Therefore, the programming of a single 4-LUT logic block inside an FPGA requires at least 
18 bits. In most practical applications, SRAM (Static Random Access Memory) bits are connected 
to these programmable points and provide the means to program the 4-LUT logic block. That is, 
the program for a single 4-LUT logic block is a sequence of 18 bits that uniquely specify all of its 
programmable points. 

Let us now consider the interconnect. In order to program the interconnect of an FPGA, each 
and every single switch pointin the interconnect structure has to be defined. Thatis, the programming 
of the interconnect describes how all the logic blocks are routed and connected among themselves. 
As in the case of the logic blocks, SRAM bits are commonly used to configure the wiring paths of 
the interconnect. 

From a user's perspective, the programming of an FPGA is similar to the programming of 
a traditional CPU. That is, through a compilation stage an algorithmic description is ultimately 
transformed into an executable file. However, the process required to program a FPGA is slightly 
more sophisticated, involving six basic steps: 


* Functional Description. 
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* Logic Synthesis. 


* Technology Mapping. 
* Placement. 

* Routing. 

* Bitstream Generation. 


A brief description of what is involved in each of these steps is given. 


1.6.1 FUNCTIONAL DESCRIPTION 

The desired functionality of an FPGA is described using a Hardware Description Language 
(HDL) [7]. The most important examples of these languages are VHDL and Verilog (VHDL 
is discussed in slightly more detail in a subsequent chapter). HDLs are languages with a syntax and 
semantic structure that resembles traditional programming languages such as FORTRAN or C. 
They are used to provide a high level logical description of the algorithm, without invoking specific 
gates or hardware components. Specifically, HDLs describe the intended behavior of the FPGA 
under specific inputs. 


1.6.2 LOGIC SYNTHESIS 

As its name suggests, this step transforms the behavioral description of the FPGA provided by the 
HDL into a list of interconnected logic gates. T'he output of this process is often referred to as an 
FPGA netlist. Such a logical synthesis is not unique. For example, we could have fast logical circuits 
that require a large amount of gates or smaller circuits that run at slower rates. In general, faster 
circuits require larger area (amount of logic blocks). To this end, most of the synthesis tools available 
in the market provide the option of optimizing speed vs. area. Clearly, speed and area optimizations 
play a decisive role in determining the topology of the logical components. 


1.6.3 TECHNOLOGY MAPPING 

In this step, the individual logic gates described by a FPGA netlist are separated and grouped together 
into an architecture that best matches the resources of the specific FPGA under consideration [11]. 
For example, if a simple netlist describes a logical circuit made oftwo XOR gates and two AND gates, 
then the technology mapping process determines how to group these four gates in a way that best 
matches the logical resources available inside the FPGA. As a consequence, the technology mapping 
process is specific to the type of FPGA that will be used to run the reconfigurable application (on 
the other hand, logic synthesis was a technology-independent process). Technology mapping tools 
may be targeted to optimize area, timing, power, or interconnectivity. 
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1.6.4 PLACEMENT 

In this step, all ofthe groups oflogic gates determined by the technology mapping process are assigned 
to the specific logic blocks available in the target FPGA [3, 28]. Placement is a challenging aspect of 
FPGA programming due to the exponential number of possible placements: If an FPGA has n logic 
blocks, then there are n! possible placements. Furthermore, this design has to satisfy some timing 
constraints, e.g., the time that the FPGA signal propagates from the input to the output registers. 
As n may well be on the order of millions for modern FPGA architectures, exhaustive search for the 
most optimal placement is clearly not under consideration. Furthermore, it is not always possible to 
route any given placement using the available interconnect resources. 


1.65. ROUTING 

Once the placement mapping has been determined, the routing determines the optimal interconnec- 
tion of the logic blocks using the available resources of the interconnect. Simulated annealing [56] 
and partitioning [57] are among the algorithms used by vendors such as Altera and Xilinx. 


1.6.6 BITSTREAM GENERATION 

The final stage generates a binary file that configures all the FPGA logic blocks and intercon- 
nects [17]. The resulting bitstream can then be loaded into the FPGA. The bitstream configures the 
FPGA based on the behavioral specifications that were originally described by the HDL. As such, 
it is convenient to think of the bitstream as the executable file of the FPGA program and the HDL 
as the source code. 
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From a user's perspective, given an HDL description, the synthesis of the bitstream is often reduced 
to a single operation. To date, there are a variety of FPGA vendors that offer sophisticated tools that 
automatically perform the logic synthesis, technology mapping, placement, routing, and bitstream 
generation using a single command. Most of these tools are called using a makefile that is based on 
a directory tree structure. 

Unfortunately, even after extensive use of optimizations and heuristic methods, logic optimiza- 
tion, technology mapping, placement, and routing are NP-hard problems. Indeed, the optimization 
ofa superlinearly large number of combinations requires a super-linearly large amount of time. Con- 
sequently, bitstream synthesis is a time consuming process that may require a long time to complete. 
For example, a simple design that performs vector addition of multidimensional vectors may require 
dozens of minutes for the synthesis of the bitstream to be completed. 


18 THEXILINX FPGA 


Before concluding this chapter, we discuss one of the most popular FPGAs currently available in 
the market. Xilinx is one of the largest companies that manufacture FPGAs. In addition, Xilinx also 
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develops software and tools to program and harness their FPGAs. Xilinx offers two basic types of 
FPGAs: The Virtex series for high performance applications and the low performance, but price 
friendly, Spartan FPGAs. An article to appear in Scientific Computing provides several comparative 
tables to address costs of using FPGAs [53]. 

Most reconfigurable supercomputers, like those manufactured by Cray and Silicon Graphics 
International (SGI), use FPGAs from the Xilinx Virtex family. In addition to the standard FPGA 
logic fabric, the Xilinx FPGAs feature embedded hardware functionality to carry out some of the 
most commonly used functions in computational applications. These devices possess a number of 
integer arithmetic multipliers, adders, and memories. The amount of local memory available on a 
Xilinx FPGA depends on the specific model and ranges from a few kilobits to tens of megabits. 

Xilinx also provides the Xilinx Synthesis Technology (xst) suite of tools that perform synthesis 
of HDL designs and creates Xilinx-specific bitstreams. The xst tools support both, VHDL and 
Verilog. Furthermore, Xilinx also provides ModelSim, a rather robust HDL simulation environment 
that itis extremely useful to verify the functional and timing models of the design before committing 
resources to the bitstream synthesis. 


1.9 SUMMARY 


The FPGA is a digital integrated circuit that can be configured and programmed by the user to 
perform a variety of computational tasks. As we discussed, the basic components of an FPGA are 
the logic blocks and the interconnects. FPGA programming requires a sophisticated six step process: 
Functional description of the code, logic synthesis, technology mapping, placement, routing, and 
bitstream generation. Clearly, reconfigurable programming is much more complicated and time 
consuming compared to standard programming used for traditional ASIC architectures. 
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CHAPTER 3 


Reconfigurable Supercomputing 


Large-scale scientific codes that use reconfigurable hardware and run on supercomputers are often 
called reconfigurable. In most cases, only portions of the code are implemented on an FPGA while 
the rest runs on a traditional CPU. The code implemented on an FPGA is usually a computationally 
intense kernel that is the bottleneck of the application. 

In this sense, the FPGA can be considered more as an application accelerator like a GPU 
(graphics processor unit) than as a replacement for a CPU. A programmer must determine the code 
segments that are good candidates for FPGA acceleration using such tools as the Reconfigurable 
Amenability Test (RAT) [54, 55]. This determination should take into account the communication 
costs between the host CPU and the FPGA application acceleration processor. 

Reconfigurable supercomputing combines the strengths of reconfigurable boards with high 
performance computing [14]. Each node consists of traditional CPUs combined with a high per- 
formance FPGA device. A Message Passing Interface (MPI) [44] parallel code can be designed so 
that an MPI process associated with each node manages its own FPGA. Codes can be run using 
multiple CPUs and multiple FPGAs. 

Reconfigurable supercomputers continue to be developed by mainstream companies including 
Cray, SGI (now Rackable) and SRC Computers. An SRC-7 cluster system is installed at Jackson 
State University (JSU) in support of a joint research project between JSU and the U.S. Army Engineer 
Research and Development Center (ERDC). The High Performance Embedded Reconfigurable 
Computing (HPERC) market, spurred on by military and space applications, attracts such companies 
as Nallatech, DRC, and XtremeData. Naturally, various software tools continue to be developed by 
both hardware vendors such as Altera and Xilinx as well as Intellectual Property (IP) companies such 
as Mitrionics (Mitrion-C) and Portland Group (PGI) for high-level programming. In particular, 
PGI is developing compilers to support Compute Unified Device Architecture (CUDA) language 
to program Graphics Processing Units (GPUs). Reconfigurable codes can be built for each system, 
but are not portable across platforms from different vendors. The Cray XT5 hybrid supercomputer 
with FPGA accelerators was introduced in November 2007. We will concentrate our discussion on 
the architecture and functionality of Naval Research Laboratory's (NRL) Cray XD1 introduced in 
2004. 

This chapter briefly covers the architecture of the NRL Cray XD1, with particular emphasis 
on the way the FPGAs are connected to the host processors [20]. How this architecture plays a 
role in the algorithmic design of the applications that use FPGA acceleration is also discussed. In 
addition, we will discuss some basic information necessary to understand the FPGA API interface 
used to establish communication between the host processor and the FPGA. 
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The NRL Cray XD1 consists of 36 chassis with six nodes in each chassis. Each node of the NRL 
system consists of two AMD Opteron 275 2.2 GHz dual core processors with 8 GBs of shared 
memory. Altogether, there are 216 nodes: 144 of them have Xilinx Virtex II Pro FPGAs and six of 
them have Virtex-4 FPGAs. Even though there are 216 nodes on the Cray XD1, only 150 of them 
have an FPGA. With this configuration, the NRL Cray XD1 is the largest Cray reconfigurable 
supercomputer in the world. 

The Virtex family of FPGAs are manufactured by Xilinx and, in addition to the standard 
FPGA logic fabric, feature embedded hardware functionality for commonly used functions such 
as multipliers and memories. The amount of block RAM depends on the model and ranges from 
a few kilobits into tens of megabits. Of course, the Virtex-4 FPGA is larger, and its functionality 
supersedes the one of the Virtex II Pro. The Virtex II Pro (part number xc2vp50-ff1152-7) has a clock 
frequency of 420 MHz, 23.6 K slices, 4 MB ofblock RAM and 232 Multiplier Blocks (18x18) [35]. 
The Virtex-4 LX (part number xc4vlx160-ff1148-10) has a clock frequency of 500 MHz, 67.5 K 
slices, 5 MB of block RAM and 288 Multiplier Blocks (18x18) [36]. 

It is important to note that each of the processors inside a node with an FPGA has access to 
the FPGA inside that node while it has no direct access to an FPGA in a neighboring node. The 
Rapid Array Interconnect (RAT) uses a bi-directional bus to connect the FPGA with the processors 
inside the node. The theoretical bandwidth of this connection is 3.2 GB per second (1.6 GB/s each 
way); however, read operations are much slower at about 10 KB/s due to a flaw in the implementation 
of the Opteron. 

Each of the NRL Cray FPGAs has four ODR (Quad Data Rate) local memories (static 
RAMs) of 4 MB, which conceptually are equivalent to the cache memory on a standard processor. 
Each of these OIDRs can be read and written simultaneously, allowing a total of eight concurrent 
memory operations. 

The memory bandwidth between the FPGA and its cache is 3.2 GB per second. This large 
bandwidth means that the FPGA could, in principle, be kept busy while the host processor is 
concurrently transferring data to and from the FPGA. While the host processor is filling up one 
buffer in a ODR SRAM, the FPGA can work with data stored in another buffer. Unfortunately, 
coordination between the host processor and co-processor is a challenge, and some of the tools do 
not even support data transfers to/from the FPGA while the device is running. 

Because the diverse ways to send data to and from the FPGA are asymmetrical with respect 
to their speed, care needs to be taken when performing I/O operations. Table 2.1 summarizes the 
I/O channels and their speeds. 

Therefore, an efficient reconfigurable application must be designed in such a way that the 
slow I/O paths are not taken. The design must never involve the FPGA reading data directly from 
the host RAM. Instead, the host processor writes any needed data into the FPGA QDR memory. 
Similarly, the host processor should never read results from the FPGA QDRs, and the FPGA should 
write any needed results directly to the host RAM. 
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Table 2.1: Approximate Bandwidth of I/O Operations 









Source Destination Read Write 
FPGA Local QDR Memory 4x800 MB/s 4x800 MB/s 
= 3.2 GB/s = 3.2 GB/s 
FPGA Host RAM N/A (*) 711 MB/s 
Host FPGA QDR Memory 10 KB/s ©) 3.2 GB/s 
Host Host RAM Fast (GB/s) Fast (GB/s) 





(*) = Slow or not available. 
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After an FPGA program is synthesized, the resulting binary file encapsulates the hardware con- 
figuration of the FPGA (see Section 1.7 on bitstream synthesis for further information about this 
process). In order to load this file into the FPGA, harness the FPGA program and establish com- 
munication links between the host processor and the FPGA, Cray offers a utility program and an 
application programming interface (API) library of functions. Other FPGA programming tools, 
such as Mitrion-C and Celoxica, offer their own API libraries, which in turn are wrap-ups of the 
Cray API. Without loss of generality, we will concentrate on the Cray API. 

The Cray API is not used to program the logic of the FPGA but merely to harness its 
functionality and interact with the host. The FPGA code knows nothing about the API. In this 
regard, an FPGA works like any other device connected to the system. That is, the API is used to 
open the FPGA and get a handle to the device, load the FPGA program, allocate and initialize 
memory, to execute the FPGA logic, send and retrieve information from the FPGA, and halt the 
execution and close the FPGA. All of the API calls are made within a program running on the host 
processor. 

The Cray API provides various calls in C to transfer data to and from the FPGA, map memory, 
load and unload the FPGA, and open and close the FPGA device [12]. Table 2.2 lists available Cray 
API calls. 

User logic in the FPGA has access to the Opteron memory through the RapidArray Transport 
(RT) core!. The Cray RT core is written in VHDL. This core provides data transfers from the FPGA 
to the Opteron over the RT interconnect. The user logic in the FPGA sends a bus transaction to the 
RT core, which forwards it through the RT fabric to hardware on the Opteron where it becomes a 
read or write transaction to the Opteron DRAM. 

Figure 2.1 shows the physical components along with their address spaces [12]. The FPGA 
memory is accessible via a region of the HyperTransport I/O address space. Specifically, the FPGA 
memory occupies a 128 MB address region in the Rapid Array Processor (RAP). This memory is 
addressable from the application running on the Opteron. In the figure, the Opteron’s I/O space 
contains memory maps for RAP-1 and RAP-2, the latter being the interconnect to the FPGA. 


1A core is a “soft” processor embedded in the FPGA logic. 
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Table 2.2: List of Cray API Calls for Interfacing with an FPGA 

fpga_open Opens an FPGA device 

fpga_load Loads a converted binary logic file into an FPGA 

fpga reset Places the FPGA user logic into reset 

fpga start Releases the FPGA user logic from reset 

fgpa memap Maps a region of the FPGA address space into the application 
address space 

pga. mem sync Forces completion of all outstanding transactions mapped to 
FPGA memory 

fpga register ftrmem Registers a region of application memory for direct access by 
FPGA 

fpga dereg ftrmem Deregisters an FPA transfer region 

fpga rd appif. val Reads a value from the FPGA address space and guarantees ac- 
cess order 

fpga wrt appif val VYrites a value to the FPGA address space and guarantees access 
order 

fpga status Gets the status of an FPGA device 

fpga unload Clears the programming of an FPGA 

fpga. close Closes an FPGA device 











RAP-2 has a 128 MB region that the FPGA RT-core memory-maps to the FPGA. Data from the 
host can be written to the QDRs, and block and distributed RAM on the FPGA. 

The fpga_memmap, fpga rd. appif. val and fpga urt appif. val calls allow the host 
application to read and write data to the memory of the FPGA. In particular, an fpga_memmap 
call sets up an address space, which can be accessed by pointers. The fpga. rd appif. val and 
fpga urt. appif. val calls read and write, respectively, single 64-bit values at a time. Certain 
conventions are built into the memory map of the FPGA. For example, the ODR SRAM address 
space begins at 64 MB, with each ODR address space occupying a contiguous 4 MB region. The 
fpga register ftrmen call sets up a host address space to which the FPGA can read and write. 
The minimum size is one memory page and the maximum size is 1 GB. Section 5.7 gives a detailed 
example of the Cray API. 
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Figure 2.1: Physical components and Address Spaces of the Cray XD1 Opteron and FPGA. 


19 


CHAPTER 3 


Algorithmic Considerations 


By combining software and hardware in a sophisticated architecture, FPGAs offer the prospect of 
increased performance for a variety of computational applications. Currently, a variety of applications 
in the areas of cryptography, bioinformatics, signal analysis, and image processing have benefitted 
enormously from the use of an FPGA as a computation acceleration device [34]. 

Unfortunately, there is no compiler flag that automatically enables FPGA acceleration. A 
programmer has the entire burden of designing and rewriting code to make it reconfigurable. In 
addition, there is no way to predict, at the design stage, ifa reconfigurable code will exhibit a dramatic 
increase in performance. As a matter of fact, it may well happen that the reconfigurable code is slower 
than the original ASIC application. Indeed, not every computational code will be a good candidate 
for FPGA acceleration. 

Therefore, a software developer must follow a series of algorithmic considerations, timing 
analyses, and benchmarking experiments to determine whether or not an implementation is a good 
candidate for FPGA acceleration [14, 15, 53, 57]. Furthermore, because reconfigurable supercom- 
puting is a time consuming endeavor, it is crucial to determine if a specific code is worth the effort 
at an early stage. 


31 DISADVANTAGES 


Even though it has been shown that FPGAs offer huge performance advantages for a number of 
computational applications, this technology also imposes a series of unique restrictions and limita- 
tions. This should not come as a surprise. FPGAs are not nearly as sophisticated as ASIC CPUs. 
As a consequence, a programmer has to determine the best way to leverage the advantages and 
disadvantages offered by reconfigurable supercomputing. 

The main disadvantages of FPGAs regard their speed, data transfer overheads, data types, 
and level of effort required, which we discuss next. 


Speed 

Currently available, FPGAs are usually much slower than most ASIC boards. For example, 
while most of the modern CPUs work in the range of 3 GHz, connected FPGAs run on the order 
of 350 MHz. Thus, FPGAs are an order of magnitude slower than ASIC CPUs. Nevertheless, 
FPGAs are becoming faster due to fabrication improvements. Importantly, having a slower clock 


allows FPGAs to run using approximately 1/10" of the power and cooling requirements of their 
ASIC counterparts [58]. 
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Data Transfer Overheads 

When programming reconfigurable codes, the memories of the processors and coprocessors 
are rarely shared. Hence, it is normally necessary to transfer data from the memory of the CPU to 
the local memory of the FPGA, which delays computations because the coprocessor must wait for 
the transfer to complete. Similarly, the FPGA device needs to write the results to the memory of 
the host processor, which can be done concurrently while the coprocessor is computing other results. 
Depending on the specific architecture of the reconfigurable computer, this movement of data to 
and from the FPGA may take a significant amount of time. Loading the logic onto the device takes 
by far the most amount of time. As a consequence, this overhead could override any improvement 
in the computational performance provided by the FPGA. 


Data Types 

Even though floating point operations are becoming more feasible to implement in mod- 
ern FPGAs, the most efficient implementation, implementations use integers or fixed point num- 
bers [32]. Indeed, the use of a small number of floating point variables may be enough to use most 
of the logic blocks available in the FPGA. Increasing the bit width adds logic over most of the area 
of the chip as numerous LUTS are needed. 


Level of Effort 

Reconfigurable computing using FPGAs requires a large investment in time for design, pro- 
gramming, simulation, testing and debugging. There is a learning curve for any tool. The tools, 
which are themselves new technologies, must adapt to the rapid changes in FPGAs and systems 
which use them. Because reconfigurable supercomputers are complex, many different software tools 
are needed. Although these divergent tools largely work well together, none of them alleviate the 
programmer of much of the hard work, especially in the area of design. 


32 DATA PARALLELISM, PIPELINING, AND CONCUR- 
RENCY 


As mentioned in the previous section, FPGAs suffer from a series of disadvantages. However, it is 
important to understand that the real advantage of FPGAs is that, depending on the application, 
they can be configured to exploit massive amounts of parallelism. In this regard, ignoring the time 
needed to load the logic and the overhead of data transfers, if an FPGA is ten times slower than a 
CPU, then the FPGA needs to perform at least ten times more work per work cycle than the CPU, 
so both are of comparable performance. To be worth the effort, it is desirable that the FPGA will 
perform at least 100 times more work per cycle. In such a case, the transition of an implementation 
to a reconfigurable architecture is accelerated by a factor of ten, which justifies the time and resources 
spent in software development. 

However, an increase in performance by a factor of 100 is only possible if the original appli- 
cation features a high degree of data parallelism, pipelining, and concurrency. 
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Data Parallelism 

Asin traditional supercomputing, data parallelism [47] refers to the distribution of data across 
different computational nodes. In this regard, FPGAs work best with datasets that only exhibit a few, 
if any, data dependencies. Indeed, FPGAs are good at exploiting data parallelism because the same 
operations can be performed on different data. As a consequence, the FPGA design will require a 
significantly smaller amount of logic blocks and interconnect resources. With data parallelism, the 
same logic blocks can be used over and over again until all the data has been processed. Also, the 
operations can be reordered to improve computational performance. Furthermore, computations 
exploiting data parallelism can be scheduled to maximize data reuse, to increase computational 
performance, and to minimize memory bandwidth. 

Pipelining 

As in traditional supercomputing, pipelining is related to being able to overlap computations. 
In this case, overlap usually involves shifting operations in time so that different pieces of hardware 
are working on different stages of the same task. Pipelining leads to an effective level of parallelism 
in the implementation. Data parallelism cannot always be fully exploited due to I/O limitations; 
whereas, pipelining permits parallel processing without increasing I/O bandwidth. 

In essence, pipelining is particularly important in reconfigurable computing because the oper- 
ations carried out in the FPGAs are often limited by delays in the interconnect. Therefore, pipelining 
is essential to increase parallelism and reusability of the reconfigured hardware at the expense of some 
latency (the amount of time it takes for a block of data to be processed). As a consequence, target 
applications that are able to tolerate latency are suitable candidates for FPGA acceleration. 


Concurrency 

As in traditional supercomputing, concurrency is the ability of certain codes to perform several 
different computations simultaneously by different computational units. Thus, if we have enough 
data parallelism in the dataset, then we can apply a large amount of logical operations at the same 
time. In other words, we exploit task parallelism. Even though FPGAs are substantially slower than 
ASIC CPUS, their advantage resides in the amount of instruction level parallelism that they offer. 
Then, in order to obtain increased computational performance, the FPGA needs to complete at 
least 50 to 100 times more instructions than a regular CPU. 
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In addition to high levels of data parallelism, pipelining, and concurrency, the target application must 
conform to a series of algorithmic considerations for maximum acceleration. These considerations 
involve data element size, arithmetic complexity, and control structures. 


Data Element Size 
The size of the data element that is processed by the FPGA determines to a great degree 
the speed and size of the reconfigured circuit. This statement is true regardless of the specific data 
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type. Indeed, if the application offers a high degree of data parallelism, then the computational 
performance will depend on how many operations can be performed concurrently. Larger data 
size elements lead to larger circuits, allowing fewer computational FPGA elements and thus less 
parallelism. Therefore, one of the goals of high performance reconfigurable supercomputing is to 
look for an efficient data representation that uses the fewest possible number of bits. 


Arithmetic Complexity 

Similarly, as with large data elements, complex arithmetic operations require larger circuits, 
reducing the number of processing elements available and impacting the degree of parallelism. In 
order to obtain the highest performance, it is desirable to use the simplest operations possible. 
Consequently, reconfigurable computing will often imply a precision/performance trade-off. 


Control Structures 

In general, FPGAs have much better computational performance if most of the logic oper- 
ations can be statically scheduled. Indeed, from the context of a computer program, it takes time 
and resources to make decisions. Furthermore, the conditional if statement requires sophisticated 
circuitry that has to be implemented by the logic blocks and interconnect in the FPGA. Thus, static 
control structures may significantly speed up the computation time and require a considerably smaller 
amount of logic resources. In this regard, datasets that exhibit few dependencies often require simple 
control requirements. 


3.4 I/O CONSIDERATIONS 


As previously argued, the power of reconfigurable supercomputing using FPGAs resides on the 
exploitation of data parallelism, pipelining, and concurrency. However, in order to fully exploit 
concurrency and pipelining, data must be transmitted at high enough rates to keep the hardware 
busy. 

If there are more I/O than compute operations, concurrency and pipelining will only have a 
small effect on the overall performance of the reconfigurable application. In such a case, increase 
performance will require higher memory bandwidth, which may or may not be available. On the 
other hand, if there are more compute than I/O operations, then concurrency and pipelining will be 
able to accelerate the application by a considerable amount. 

Optimal FPGA acceleration takes place by carefully scheduling data transfers and computa- 
tion in order to achieve substantial levels of concurrency and pipelining. Data elements that have 
been fetched from memory should be reused multiple times. In other words, it is desirable to keep 
many processing units busy while using the exact same data. 

As a consequence, limited I/O bandwidth in the FPGA device imposes serious restrictions 
on the type of algorithms that can be accelerated using reconfigurable computers. If an algorithm 
involves considerable I/O (it is an I/O bound algorithm), then it may not be a good candidate for 
FPGA acceleration. 
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On the other hand, computation bound algorithms are very good candidates for FPGA accel- 
eration. Of course, the type of operations being performed limits the degree ofimproved performance. 
Regardless, the potential speedup depends on the exploitation of data parallelism, pipelining, and 
concurrency. 
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Any Boolean function can be mapped to an FPGA. The computation of such a Boolean function will 
be more efficient in the FPGA if we are able to exploit data parallelism, pipelining, and concurrency. 
If an algorithm invokes a computational kernel that repeats itself multiple times inside a loop, then 
the kernel could be programmed inside the FPGA. Therefore, the ASIC CPU executes the program 
and only invokes the FPGA to execute the compute-intensive loop. 

Programs that are good candidates for FPGA acceleration include those for which a single 
loop is performing a large number of simple operations. In these codes, the loop dominates the 
overall execution time of the algorithm. In this case, the loop and its kernel are the targets for FFGA 
acceleration, instead of the entire code. Furthermore, by Amdahl's law, the acceleration of this single 
loop will lead to a significant increase in the overall performance of the entire code. 

Similarly, if the target code is rather complex, featuring dozens of time consuming loops and 
other bottlenecks, then the FPGA acceleration of a few loops will not provide any significant gains 
in performance. Indeed, from Amdahls' law, the acceleration of a few loops will not provide any 
significant gains in computational performance. 

Even if a single loop dominates the execution time of the target program, further algorithmic 
considerations are necessary before considering the use of FPGAs to accelerate the code. In general, 
the structure and type of operations inside the loop are dramatically restricted. Issues involved in 
the analysis of the loop include: 


* Avoid pointer operations used to handle common data structures such as linked lists. Such data 
structures are inconsistent with parallel access and efficient use of local memory. In general, 
an implementation must be rewritten, if necessary, to handle only values instead of pointers. 


Avoid double precision computations, as well as sophisticated trigonometric, logarithmic, and 
transcendental operations. The reason is that these operations require a considerable area of 
the chip. 


Avoid complex data dependencies. Recall the goal of FPGA acceleration is to take advantage 
of the potential parallelism in the implementation. Hence, the body of the loop must be easily 
parallelizable. Complex data dependencies typically require some of the configured blocks of 
hardware to remain idle until data becomes available. 


The number of iterations should be known at compile time. Indeed, runtime control requires 
sophisticated logic structures that demand large amounts of logic blocks and interconnects. 
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* Data blocks of arrays accessed in the body of a loop should not be very big. FPGAs execute 
more efficiently by applying operations to a relatively small number of data elements. In such 
a manner, an FPGA is able to reuse large portions of logic blocks and interconnects. It may be 
possible to rearrange the computations to permit such data accesses. Important considerations 
include: 


— VO requirements. For instance, searching a large database using FPGAs is extremely 
inefficient unless the host processor and coprocessor enjoy shared memory. 


— Depth of nested loops. The implementation of four or more nested loops will consume a 
large amount of logic blocks and interconnect. Also, deeply nested loops will negatively 
affect latency. 


— Number of conditional operations. The exact same operations should be performed on all 
of the elements inside a loop. Conditional statements inside a loop are perfectly correct 
and allowed by the model. However, conditionals will consume valuable resources and 
a large number of conditional statements may become a limiting factor for successful 
acceleration. 


A clear pattern emerges from these loop restrictions: In order to achieve considerable com- 
putational gains, the loops should be easily expressed in an algorithm that exploits data parallelism, 
pipelining and concurrency. 


3.6 EFFECTIVE RECONFIGURABLE ALGORITHM DESIGN 


At this point, it should be clear that FPGAs and CPUs offer distinct capabilities, restrictions, 
advantages, and disadvantages. Therefore, an algorithm that may be a poor choice for a traditional 
CPU may actually be an excellent choice to be implemented on a FPGA, and vice versa. 

For example, consider an application that merely organizes data into a histogram. That is, 
given a dataset, we gather data elements into specific groups according to their value. In traditional 
C programming, for instance, an optimal implementation does not involve the brute force approach 
of exhaustive search. Instead, it is recommended to implement a variant of binary search. On the 
other hand, an FPGA implementation based on the brute force method is practical as its inherent 
data parallelism, concurrency, and pipelining attributes can be directly exploited. 

Let us consider another example. Suppose the following code segment is the kernel of a 
program that we wish to accelerate using FPGAs. 


for (i20; i<n; i++) 
1 
a[i] - O; 
for (j-0; j<m; j++) 
ali] = ali]  K[i][j]*f(a[i]); 
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Is this computational kernel a good candidate for FPGA acceleration? Following our previous 
discussion, we can conclude that, if n and m are known at compile time, a and K are arrays of integers 
or single precision numbers, and f is a function that does not involve trigonometric or logarithmic 
operations, then this loop appears to be a good candidate for FPGA acceleration. Various strategies 
are possible to implement this nested loop in an FPGA; some of which we discuss next. 


Sequential 

All the computations are performed inside a single processing element. This strategy does not 
exploit any possible level of data parallelism, instruction concurrency, and pipelining. As FPGAs are 
considerably slower than CPUs, and due to the big data transfer overhead involved, this strategy 
will perform much worse than the original CPU implementation. 


Parallel 

This strategy performs the loop computations using many processing elements. That is, the 
reconfigurable circuit that computes the body of the loop is replicated several times across the entire 
area of the FPGA. Using a completely parallel strategy, all O(n xm) loop operations are performed 
simultaneously, in a single computational step. This strategy clearly exploits the available degree of 
data parallelism and instruction concurrency. However, it may reguire a large amount of logic blocks 
and interconnect resources, which may not be available. 


Partly Parallel 

This strategy uses O(n) processing elements to compute the body of the innermost loop. 
Therefore, the entire code is completed in about O(m) computational time steps. This strategy is the 
best of the three choices because it reuses hardware with good performance. 


37. SUMMARY 


'To summarize this section, we provide a simple heuristic methodology to determine if a given code 
is a good candidate for the FPGA: 


1. Perform benchmarking analysis to determine if the code is computation or I/O bound. If it is 
I/O bound, then it is probably not a good candidate for FPGA acceleration. 


2. Perform benchmarking analysis to determine any possible bottlenecks in the code. If most of 
the time is spent in a single loop, then this loop is a good candidate for the FPGA. If there are 
many loops consuming most of the computational time, by Amdahl's law, it is unlikely that 
we will observe significant improvements using an FPGA implementation. 


3. Determine if the target loop exhibits sufficient opportunities to exploit data parallelism, con- 
currency, and pipelining. 


4. Perform algorithmic analysis to address all the algorithmic considerations described in this 
chapter, such as parallelism, loop structure, I/O, etc. 
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CHAPTER 4 


FPGA Programming Languages 


One of the most challenging aspects of reconfigurable computing is the actual software development 
for a particular FPGA variant. FPGAs natively require Hardware Description Languages (HDL) 
such as VHDL or Verilog. VHDL and Verilog, were specifically developed for hardware design. 
As a consequence, these languages require enormous effort to develop software applications based 
on either existing implementations or algorithms. Thus, VHDL and Verilog are exceptionally good 
languages for the description and design of logical circuits, but they are extremely inefficient for 
reconfigurable computing of scientific problems. 

Such deficiency is made evident by surveying the actual literature on VHDL and Verilog 
programming. Most of the published books on the topic consider examples geared to implementing 
digital devices such as multiplexers and decoders. But there is no substantial discussion or examples 
illustrating how to use these languages to perform the programming of a search or sorting algorithm. 
As such, it becomes a challenge for the software engineer to develop computational applications using 
VHDL or Verilog. 

In recent years, a handful of FPGA programming tools and languages have appeared in the 
marketplace. These products offer to ease FPGA programming effort by using similar syntax and 
semantics as with traditional programming languages like C and Java. A few others employ user- 
friendly interfaces of well-known computational products such as MatLab. The compilation of these 
programs produces a VHDL file that can be used to synthesize a bitstream using the FPGA vendor's 
bitstream synthesis tools. 

In this category, some of the most popular FPGA programming tools and languages include 
DSPLogic [48], Handel-C [25], Impulse-C [49], and Mitrion-C [21]. Each of these languages 
presents its own specific set of advantages and disadvantages. In general, the common advantage 
these products offer is that their syntax and semantic structure is better suited for algorithmic 
design than for hardware description. Handel-C, for instance, has the advantage that some of its 
semantics and syntax are identical to traditional C. On the other hand, DSPLogic is based on Matlab 
and Simulink, and uses a very simple graphical user interface. Matlab and other packages such as 
Mathematica, Star-P, and Python are considered Very High Level Languages for HPC [41]. 

Nevertheless, these FPGA tools presenta variety of obstacles. Each tool requires a programmer 
to learn a new paradigm. Although an experienced programmer can learn a traditional language in a 
couple weeks, it takes much longer to understand a new paradigm as well as the syntax and semantics. 
In particular, good parallel programming is challenging by itself. 

Furthermore, reconfigurable acceleration of a code with a limited tool kit is a formidable task 
even for an advanced FPGA programmer. In addition, because these languages basically describe the 
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functionality of the FPGA logic blocks and interconnects, they are severely restricted in the types 
of things that they can do. These tools are less efficient and less flexible when compared to a low 
level HDL design. Notwithstanding, we expect compilers will increasingly map higher-order codes 
more efficiently onto FPGA devices. 

Regardless of the languages or tools used, the programming of FPGAs is a challenging and 
time-consuming task. In this regard, the principal problems confronted by the software developer 
include a difficult debugging process and a nearly total lack of portability. Efficient debugging, 
for instance, is extremely difficult to accomplish within the realm of reconfigurable computing. The 
principal reason for this limitation is the inability to easily print the contents ofvariables in the FPGA 
code without increasing the logic and changing the timings, i.e., without changing the program. In 
addition, the bitstream synthesis is a process that typically takes several hours or more. Hence, the 
traditional debugging strategy of using multiple print statements and frequent recompiling is not 
feasible when programming FPGAs. 

Within the context of reconfigurable computing, the only way to perform debugging is using 
a simulator provided by the vendor. These simulators mimic the functionality of the FPGA under 
certain inputs and provide a reasonable way to determine when an error has occurred. T'he usefulness 
of an FPGA programming language is limited by functionality that the associated simulator provides. 

Further complicating the FPGA programming process is the lack of code portability across 
different reconfigurable supercomputers. For example, the SGI and Cray reconfigurable supercom- 
puters require completely different host and FPGA code to account for their differences in terms of 
architecture and available FPGA devices. 

Furthermore, there is no portability across different FPGA programming languages. For in- 
stance, Mitrion-C code is completely different from DSPLogic, Handel-C, Impulse-C, and VHDL. 
In addition, the specific interface to the host CPU is also completely different for all of these FPGA 
programming languages. 

In the remainder of this chapter, we will closely examine four popular FPGA program- 
ming languages: VHDL, DSPLogic, Mitrion-C, and Handel-C. Other languages such as Verilog, 
Impulse-C, and Chimps are covered elsewhere. 


41 VHDL 


Register Transfer Level (RTL) tools provide a description of a digital design using logical expressions. 
Thus, RTL describes behavior of a digital circuit in terms of signals between registers and the logical 
operations performed on them. In this regard logical expressions, such as AND or XOR, form a 
higher level of abstraction above gate description. RTL tools allow the full control of register-to- 
register logic without having to select the specific gates that implement the design. 

As a consequence, RTL tools are helpful to create a high level representation of a circuit 
for which a low level gate and circuit representation can be easily derived. In particular, Hardware 
Description Languages (HDLs) produce an RTL abstraction of digital circuits. HDLs focus on the 
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description of how signals flow between different registers and how they change under certain logic 
operations. Examples of HDLs include VHDL and Verilog. 

VHDL stands for VHSIC Hardware Description Language where VHSIC is Very High 
Speed Integrated Circuits. VHSIC was a project sponsored by the US Department of Defense 
during the 1980s to accelerate advances in digital circuit technology. VHDL, as an RTL hard- 
ware description language, is a product of the VHSIC effort. Thus, VHDL was conceived as a 
tool to document the behavior of ASIC processors. VHDL offers a superior alternative to layout, 
focuses on the functionality of the device, and is completely independent of details specific to the 
implementation [23, 24]. 

In particular, one can design digital hardware in VHDL [7]. The description of the design 
in VHDL is subseguently used to produce the RTL schematic of the digital circuit. In addition, 
VHDL allows the digital design to be modeled and verified before the synthesis tools translate the 
design into a specific implementation using wires and gates. 

Some of the features that distinguish VHDL include the following: 


* Strongly typed. 
* Based on Ada. 
* Features several similarities with object oriented languages such as C++ and Java. 


* Contains special constructs and semantics to describe instruction concurrency at the hardware 
level. 


* Follows IEEE standards. 
* Periodically revised to reflect the latest technological and industry trends. 


Some of the deficiencies of VHDL as a programming language for FPGAs, within the context 
of reconfigurable supercomputing include the following: 


* The user is required to learn and master VHDL, a time consuming effort. 
* VHDL is algorithmically deficient. 
* VHDL requires several lines of code to describe very simple arithmetic expressions. 


* The functionality of the FPGA is described at a very low level of abstraction, which may not 
be of interest to the developer of reconfigurable computing codes. 


* VHDL as a language to program FPGAs inside reconfigurable supercomputers lacks porta- 
bility. Indeed, the VHDL code strongly depends on the technological implementation of the 
specific FPGA. 


* VHDL is extremely difficult and time consuming to benchmark and optimize. 
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* VHDL code is extremely difficult to debug. 


VHDL is a language that was developed especially for digital circuit design and not for 
algorithmic programming. This observation makes it clear why VHDL is mainly used by engineers 
to handcraft special implementations and not by programmers accelerating codes within the realm of 
reconfigurable supercomputing. As a consequence, unless the user is an expert VHDL programmer, 
this language is extremely difficult for the development of algorithmic applications. Consequently, 
unless the user is already an expert VHDL programming, it is not recommended as a language to 
program FPGAs. 

Verilog has similar capabilities as VHDL. Perhaps the most notable difference is that Verilog 
is not as strongly typed as VHDL. And truth be told, supporters of both languages regularly sustain 
heated debates concerning the superiority of one over the other. In any event, for the purposes of 
reconfigurable supercomputing, Verilog and VHDL are equivalent in the sense that they both share 
the same type of shortcomings for software development. 
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DSPLogic produces a “Rapid Reconfigurable Computing Toolbox" aimed at the graphical program- 
ming for the fast development of computational applications on FPGAs and standard CPUs. In 
strong contrast with Mitrion-C and Handel-C, DSPLogic does not offer a new computational lan- 
guage with syntactic and semantic similarities to C, but presents, instead, a sophisticated graphical 
interface where a computational program is built by interconnecting a variety of modules. 

The DSPLogic system is based on Matlab and Simulink, two popular software packages 
for mathematical analysis and simulation. Matlab provides a powerful numerical computing envi- 
ronment harnessed by an interpreted, high-level programming and scripting language. The three 
obvious strengths of Matlab are its implementation of linear algebra, Fourier analysis, and numerical 
algorithms. Thus, Matlab is widely used for a variety of applications in the areas of optics, radar, 
acoustics, and signal analysis. 

Simulink is a graphical language that wraps Matlab into a sophisticated graphical block 
diagramming tool. A large number of computational modules are included in the block library of 
Simulink, and permits the user build, customize, and store new modules. In Simulink, algorithms 
are implemented by dragging library blocks into a visual programming workspace and establishing 
connections between them. Simulink also adds a time flow variable, which permits the simulation of 
the temporal evolution of the system. By relying on the computational features of Matlab, Simulink 
is a rather powerful tool for modeling, simulation, and analysis of multidomain dynamical systems. 
Simulink is mostly used in control theory and digital signal analysis. 

DSPLogic's RC Toolbox is fully integrated with Matlab and Simulink, inheriting both of 
its advantages and disadvantages. Thus, FPGA programming using this tool must be done using 
Simulink’s graphical environment, with the benefit that all FPGA implementation operations are 
hidden in layers of abstraction inside library modules. The library includes modules that in principle 
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allow access to Xilinx primitives and VHDL code. The program can be designed to run sequentially, 
pipelined, or in parallel. 

Because of its tight integration with Simulink, DSPLogic RC offers very efficient and reliable 
simulation and debugging tools, however, DSPLogic works only on a Windows-based PC. The user 
has to design, program, and synthesize the FPGA code on the PC and then move the resulting 
bitstream file to the XD1. 

The data streaming and communications between the host processor and the FPGA are 
handled by the DSPLogic’s messaging abstraction layer, a wrap of the Cray XD1 API library 
functions. The host program invokes a series of library calls to establish the connection, configure 
the device and communicate with the FPGA. 

Overall, DSPLogic's RC diagrammatic programming appears to be better suited for digital 
circuit design although it has been used for bioinformatics, cryptography, and signal processing. The 
block abstraction and code encapsulation features inherited from Simulink may prove valuable for 
the design of very large and complex reconfigurable codes. However, even simple reconfigurable 
codes, such as the addition of two multidimensional vectors, often require dozens of interconnecting 
blocks. In addition, based on the Simulink graphical programming paradigm, the user needs to learn 
Matlab and Simulink. 
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Handel-C is a unique Hardware Design Language (HDL), which was originally developed by 
Oxford Hardware Compilation Group and was later developed into a proprietary product by Celoxica 
(acquired by Catalytics, which was merged to form Agility whose C synthesis assets were eventually 
acquired by Mentor Graphics in 2009), which is no longer supported. In any case, Handel-C is 
well established compared to the relatively new and novel language Mitrion-C (covered in the 
next section). Handel-C may be viewed as an extension of a large subset of standard C, including 
its precedence rules for operators and syntax for control structures (for, if, switch, while), 
functions and expressions, operators, and data structures (arrays, structures, pointers, variables) [25]. 
The name honors the classical music composer George Frederic Handel, with the “-C” indicating 
its relationship to the C language. 

Handel-C code is intended to be portable excluding the parts dealing with low-level hardware 
interfaces (pins) and customized interfaces to software modules written in Handel-C, VHDL or 
Verilog (ports). Yet, simply redefining the interfaces for another system is not likely enough to obtain 
a program that runs satisfactorily. Indeed, writing an efficient program normally begins with a careful 
design of the interfaces to effectively utilize the hardware. On the other hand, functions compiled 
into libraries can be reused on the same system or on the same platform with identical I/O ports 
and memory subsystems. 

The language provides channels for point-to-point communication between processes exe- 
cuting in parallel. Channels are familiar from the occam programming language that is based on 
Communicating Sequential Processes (CSP). One branch outputs data onto a channel and another 
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branch reads data from the same channel. If there are multiple channels, reads and writes are listed 
according to a desired priority to select the first channel that is ready (prialt). If communication is 
not synchronized, then a FIFO queue is employed so that writes are blocked only when the queue 
is full, and reads are blocked only when the queue is empty. 

The language also provides semaphores to allow the coordination of shared resources. A 
semaphore should be held only as long as needed (trysema and releasesema) in the "critical sec- 
tion" such as an update of a variable that is shared. Priority is undefined (as its implementation is 
proprietary) when several processes concurrently attempt to take a semaphore to create a lock. 

Each assignment, increment, decrement, channel communication, delay, and release of a 
semaphore takes one (clock) cycle. Functions, expressions, conditions and jumps (break, goto, 
continue, return) are evaluated in zero cycles. Signals are objects that behave like wires as during 
the same cycle the value read is the value assigned. 

Loops require at least one cycle; whence, empty loops are not permitted. A conditional test (if 
block) before a loop is inserted whenever there is a possibility the body is not executed at least once. 
Whenever inserting an if block, an else block is included with a delay statement. Similarly, a delay 
is added in a default statement of a switch construct. The compiler may also add delay statements 
to break loops in order to avoid timing conflicts. 

For loops assignment of the loop counter after the execution of the loop body requires an extra 
cycle. While loops are preferable as the loop counter is updated in parallel with other statements 
in the loop body. This optimization is significant whenever the entire loop is executed many times 
(such as in a nested loop). 

At most one "element" in an internal block or external ROM or RAM can be accessed in any 
clock cycle. ROMs permit only read accesses. By using “multi-ported” RAM (mpram), it is possible to 
simultaneously perform a read and a write, a couple reads, or a couple writes. The maximum number 
of accesses in mpram during a cycle depends on the actual number and type of ports available in 
hardware, e.g., a Virtex-4 FPGA has a read/write port and a read port. 

The clock period is determined by the longest expression (involving arithmetic and shifting 
operations and possibly signals) in a single statement, the deepest depth of nested control structures, 
or the longest chain of logical control blocks. The reason is that all of the operations or conditions 
must be executed in a single cycle, which means the delay can become significant. If the delay is too 
long, timing constraints may not be met. A programmer should avoid division, break up complex 
statements into several simpler statements (using registers instead of signals), reduce the bit-widths 
of variables, and avoid nesting control structures more than a few layers deep. If the code is carefully 
written using registers in a pipeline fashion, flip-flops can be automatically moved by the compiler. 

Execution of successive statements and blocks is sequential by default, except all cases of a 
switch statement (and all branches of an if block) are evaluated in parallel. Handel-C provides a 
par(allel) block to concurrently execute all statements and blocks within the par block. Care must 
be taken when grouping statements together in a par block because it is not possible to write to 
the same variable during the same cycle. Blocks may be explicitly modified as seq(uential) blocks 
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(default case) as documented. Parallel and sequential blocks may be nested. Parallel blocks complete 
only when all statements and blocks within the par block complete, i.e., when the longest “branch” 
completes. 

Handle-C also provides parallel (par) and sequential (seq) loops. The loop body is duplicated 
with appropriate indices depending upon the number of steps specified. In par and seq loops, the 
replicated bodies are executed in parallel and sequentially, respectively. 

Side effects are not permitted in expressions. In particular, the comma operator cannot be 
used to list expressions. Consider the following statement in C: 


a = d--, --b - 1; 


This C statement may be written in Handel-C as 


par 

1 
a=b- 2; 
b--; 
d--; 

} 


Variables are in effect registers. It is not safe to assume a variable is initialized to zero unless 
it is static. Only static and global variables can be initialized. Initialization reduces the amount of 
logic and clock cycles needed. 

The width of data types is not fixed (not restricted to 16-bit, 32-bit, or 64-bit, etc.). The width 
of a variable is either specified or automatically inferred. Avoid using longer widths than necessary 
to more efficiently utilize the hardware resources on the chip. 

Unlike in standard C, a Handel-C program may consist of multiple main () functions, each 
of which is associated with a clock so that different parts of the program run at different speeds. 
However, all main() functions in the same source file must use the same clock. Unlike other 
functions, a main () function neither takes any parameters nor returns any values. 

At most a single function call may appear in and only in an expression statement. For example, 
f(x) is a valid expression; whereas, f (g(x)) and f(x) tg(x) are invalid expressions. Function 
parameters (including pointers) are copied from the calling function, i.e., references are not modified. 
The types of the parameters must be explicitly declared in the function definition. 

Functions may be either inlined for parallel execution so that the hardware resources are 
not shared, or "shared" for sequential execution. Arrays of functions may be defined to make a fixed 
number of copies for parallel execution. Whenever a function is copied (whether in an inline function 
or an array of functions), all statically declared variables are independent among all of the copies. 
Recursive functions are not supported because all of the logic must be determined at compile time. 

The language provides extensive macro support to avoid rewriting tedious details. Each macro 
call involves unshared hardware resources. Macro expressions are useful to make copies of expressions 
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(vs. statements) for parallel execution. Unlike macro expressions, a macro procedure, which is similar 
to an inline function, consists of complete statements and does not return a value. 

Unlike functions, macros permit call-by-reference untyped parameters so that the actual refer- 
ences are used instead of a copy. For instance, a macro procedure swap (x, y) swaps the values of the 
actual variables x and y listed in a macro call whereas a function call swap (x, y) cannot modify the 
references (only copies of x and y). Non-parameterized macros are similar to preprocessor directives 
(#define). 

Complex macros may be defined using “let” definitions of simpler macros. Recursive macro 
expressions and procedures are supported. A “select macro” works like the ternary operator (?:) 
in C except the condition is evaluated at compile time; whence, the hardware is not configured to 
perform the ternary operation. Similarly, an “ifselect macro” works like a logical if statement: Only 
the selected block of code is expanded at compile time. 

Unlike macro expressions, shared expressions provide opportunities to reuse hardware to 
reduce the amount of time that resources are idle. Shared expressions are useful when the same 
computations (involving different inputs) are performed at different times during program execution. 
Shared expressions are defined and called using the same syntax as macro expressions except for the 
declaration as a shared or macro expression. Although recursion is not supported directly, recursive 
shared expressions may be built up using recursive macro expressions. 

The language provides built-in base data types for characters, signed and unsigned integers. 
Derived types for data aggregates include arrays and structures. Arrays are like RAMs as a linear 
memory model. Arrays are needed for parallel access due to the restrictions for accesses to memory. 
Structures are useful for holding data like the mantissa and exponent of floating point numbers. Sup- 
port for fixed point and floating point is provided in the form of libraries. In addition to familiar op- 
erators, the language provides special operators for bit manipulation, range selection (a[ MSB:LSB]) 
and concatenation. 

Handel-C provides buses for interfaces with external devises using pins. User-defined inter- 
faces are used for communication involving other codes when Handel-C is the top-level module in 
a design. Ports are used for communication between modules when Handel-C is not the top-level 
module in a design. 

Few examples exist of programs that target the Xilinx FPGA’s and actually run on the Cray 
XD1. Unfortunately, those examples that do exist are trivial. 

A command-line compiler (handelc) runs under Linux to compile Handel-C code targeting a 
particular FPGA variant. For example, the following command targets a Xilinx Virtex II Pro FPGA 
on the Cray XD1: 


handelc -f XilinxVirtexIIPro -p xc2vp50-ff1152-6 -edif \ 

-I "/usr/share/celoxica/PSL/Hardware/Include" \ 

-I "/usr/share/celoxica/hardware/include" \ 

-xl "/usr/share/celoxica/PSL/Hardware/Lib/cray\_xd1.hcl’’ \ 
-D NDEBUG -D __EDIF__ AddOne/main.hcc 


43. HANDEL-C 35 


A version of the Design Kit (DK Design Suite) is available for the Cray XD1. The DK consists 
of an integrated development environment (IDE) to compile, debug and simulate, and provides a 
graphical user interface (GUI) in an xterm window (dkdev). To use the GUI tools, some setup is 
required. 

The Platform Developer's Kit (PDK) includes libraries, tools and code to enhance software 
development for different platforms [8, 26]. The XD1 Platform Support Library (PSL) is a col- 
lection of Handel-C interfaces to Cray cores (e.g., rt. core. vhd). Although not supported on the 
Cray XD1, the PDK was intended to support co-simulation with VHDL, Verilog, SystemC, and 
MATLAB designs and provides implementations of the PSL, Platform Abstraction Layer (PAL), 
and Data Stream Manager (DSM) components. PAL provides an application programming inter- 
face (API) to enhance portability. DSM provides a data transfer API to enhance communication 
between the processor (SMP) and the application acceleration processor (FPGA AAP). 

Unlike Mitrion-C, Handel-C does not reguire space on the chip for a virtual processor, 
whence, more hardware resources are available for the application. To run only a simulation, a 
license is reguired; whereas, a free simulator is available to run a simulation based on code developed 
using Mitrion-C. If the simulation is correct, then the appropriate Handel-C interfaces may be 
added to run on hardware. ModelSim and Riviera can be used for verification, but these models 
are not available for the Cray XD1. However, a version of ModelSim that runs under a Microsoft 
Window OS is available. 

The memory map must be defined in a Handel-C file. A programmer configures the FPGA 
address space following certain conventions for the Xilinx software tools and Cray API. Although 
these conventions do not depend on the programming language, the various software tools from the 
different vendors work successfully together because the same conventions are followed. 

On the Cray XD1, the FPGA address space occupies a 128 MB region of HyperTransport 
I/O address space. A programmer divides up this address space into memory ranges using non- 
overlapping byte addresses. The top 64 MB of FPGA address space may be used for registers. It is 
convenient to use preprocessor directives to configure the FPGA address space as shown below. 


#define MEG (1024*1024) 
#define TOP ( 128 * MEG ) 
#define REG TOP ( TOP - 1 ) // Top of 64 MB region for registers 


#define REG_BASE ( 64 * MEG ) // 0x4000000 
#define QDR_TOP ( REG_BASE - 1)//Top of 4MB region for QDRII SRAM 
#define QDR_BASE (60 * MEG ) // 0x3C00000 


#define UNUSED_TOP ( QDR\_BASE - 1 ) 

#define UNUSED_BASE ( 16 * 1024 ) // 0x4000 

#define BRAM_TOP ( UNUSED - 1)//Top of 16 KB region for Block RAM 
#define BRAM BASE ( O * MEG ) // 0x0 
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The bottom 64 MB may be mapped to the QDR II SRAMs. In particular, MO mapped 
accesses allow a feature called write combining. In the source code, the addresses must be shifted 
three bits to the right to convert byte addresses to quadword (64-bit) addresses. 

In addition to the configuration ofthe FPGA address space, a Handel-C file normally includes 
calls using macro procedures defined in a PSL library. These macro procedures involve the low-level 
Handel-C interfaces to VHDL cores (hardware interfaces). On the Cray XD1, typical macro calls 
include 


RunRTIf () 


RunQDRIf () 


RunRTClient( APP BASE, APP TOP, AppRead, AppWrite ) 


RunQDR( QDR. BASE, QDR TOP ) 


RunBRam( BRAM BASE, BRAM TOP ) 


RunReg( REG, BASE, REG, TOP ) 


RunRTIf () sets up the Rapid Transport (RT) core and waits for the RapidArray Processors 
(RAPs) to come up. RunRTClient () runs the RT core using the specified address space and macro 
procedures. Other calls use RT client and specified macro procedures to use the ODRs, block RAMs 
and registers. 

A programmer must write the C code that runs on the Opteron symmetric multiprocessors 
(SMPs) using a consistent memory map defined in a Handel-C file, permitting communication 
between the processor and the coprocessor. Using the Cray API, it is possible to open, close, reset 
and start the FPGA device, and to read and write registers. 

As usual, the C code includes the appropriate header files and is compiled using the gcc 
command. 

After the compiling process is complete, the Xilinx software tools are used to create a bitstream 
that can be loaded into an FPGA and run. This synthesis process consists of placing and routing the 
logic on the FPGA, which is performed using a proprietary nondeterministic algorithm. Although 
compiling is fast, synthesizing can take many hours or days. 


4.3.1  CELOXICA EXAMPLE 

A Celoxica example (Appendix A) is based on a hyperspectral imaging project headed up by Charles 
Bachman of NRUs Remote Sensing Division [2]. The project takes multi-spectral images of features 
such as vegetation, soil and water. For each image, 124 wavelengths are collected. Due to light 
scattering, much nonlinearity exists in the data, making it difficult to provide an accurate analysis. A 
non-linear technigue, the ISOMAP method [29], based on the observation that the hyperspectral 
image points appear to lie on a lower dimensional manifold compared to the full 124 dimensional 
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space. The ISOMAP method numerically determines a generalized coordinate system that describes 
the manifold and the colorization procedure follows from there. 

The ISOMAP method is easily applied to small images, but the littoral images usually contain 
tens of millions of pixels and the resulting calculation easily exceeds available computational time 
and memory limits of any available computer. The calculation involves the determination of pair 
wise and minimum path distances between all image points and has cubic scaling. Specialized search 
algorithms and data structures are used to overcome this computational hurdle. One of the data 
structures involves a search for intersections between points and point-neighborhoods. 

David Gardner (Celoxica) wrote an example as part of Celoxica software support for NRL, 
addressing the search for intersections. He modified the problem of finding intersections, to finding 
intersections of points with four-dimensional hypercubes. The point coordinates as well as the hy- 
percube sizes and origins are picked at random using 64 hypercubes and 128K points. The calculation 
is repeated 500 times so that the wall clock times are on the order of 100 seconds for the Opteron 
version. 

The source code example, including the C code search.c and the Handel-C code 
search.hcc, appears in Appendix A. The search is carried out both on the FPGA processor and 
on the Opteron ASIC processor. The C code demonstrates how to set up the FTR region in order 
transfer data from the Opteron to the FPGA using the Cray API. The Handel-C code uses the 
Celoxica macros to handle the data in the FTR region. 

The algorithm tested chooses hypercubes such that there is a 5096 chance of intersection. 
The FPGA speed up is dependent on this number — a lower percentage would benefit the software 
version, a higher percentage benefits the hardware version. Given this caveat, we observe a factor of 
113 speed-up on the FPGA compared to the Opteron. 
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Mitrion-C is a novel high-level programming language for the Mitrion Virtual Processor (MVP) [20, 
21, 27]. This soft core may be viewed as a virtual machine for which special programming constructs 
are mapped to custom VHDL code for a variant ofan FPGA chip. This virtual machine is a massively 
parallel computer that permits the type of fine-grain parallelism that characterizes FPGAs and 
implements speculative evaluation of if-statements. The MVP is not an abstraction and actually 
runs on an FPGA to carry-out the computations prescribed in the Mitrion-C source code. 

Once a programmer learns the Mitrion-C programming paradigm, it is fairly easy to write 
various programs for practically any supported system and FPGA variant. Presumably, a programmer 
using a hardware description language (HDL) would spend considerable time writing implemen- 
tations for new FPGA variants that are being developed at a rapid rate. Yet, Mitrion-C code is 
hardly portable. T'he reason is that efficient code is designed to exploit the peculiar architecture. For 
example, a programmer needs to know how many memory banks are available, their bit-widths, and 
the I/O operations that are supported. A sound design principle is to first write code in a high-level 
language such as Mitrion-C and, after testing, design and write HDL code. 
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Unfortunately, the MVP takes up most of the FPGA resources so that only a fraction of the 
resources are available to the programmer. The base overhead of the MVP copying inputs to outputs 
with out any other logic is 1196 ofthe FPGA [60]. Another disadvantage of using Mitrion-C is that 
the MVP runs at a fixed clock frequency that is slower than current FPGAs. 

Mitrion-C replaces conventional arrays in C by so-called "collections." There are three types 
of these data aggregates: Lists, streams and vectors. While lists and streams are most suitable for the 
LISP programming model (LISt Processing), vectors are highly suitable for parallel programming. 
Unlike streams, lists and vectors can be built-up or made smaller using the concatenation (><), take 
(</) and drop (/«) operators. The operands must be one-dimensional (undocumented), which is 
not checked by the compiler. 

As technology changes rapidly, it is not feasible to fully support every chip unless there is 
strong market demand. So, it should not be surprising that all features are not supported. Yet, it is 
amazing that so many different software tools, which are needed to program complex systems, work 
together as well as they do, at least in the case of Cray, software development by Mitrionics and 
Xilinx. 

Typically, synthesis is possible for Mitrion-C source code that compiles, provided less than half 
of the potential flip flops are used and memory is sufficiently large. Synthesis is sometimes successful 
using the standard Xilinx tools even when the number of flip flops is near 7590. Code that produces 
correct results in simulation and that synthesizes successfully, based on the Xilinx parcheck utility, 
may either not terminate (i.e., not halt), or produce erroneous results when running on hardware. 

In order to synthesize, it may be necessary to reduce the problem size even when only approx- 
imately half of the potential flip flops are used. Although the number of statements is important, the 
size of words in internal memory (Block RAMs) and especially the size of elements in collections 
will largely limit the size of the problem that can be solved on an FPGA. If an FPGA is used to 
perform non-trivial operations on a large number of long lists, then it is likely the number of bits 
used to store the elements is small. 

Unlike conventional arrays in C, collections are immutable objects that cannot be modified. 
Loops in Mitrion-C provide the primary tool to form new collections from existing ones of the 
same type. Lists and vectors but not streams can be redefined in loops. However, redefining reguires 
that the size is unchanged as the length of lists and vectors is fixed. 

Algorithms consisting mainly of loops are easily written in Mitrion-C. The base types for 
collections are Boolean, floating point and integer. If a programmer has additional information that 
allows the bit-width ofa type to be reduced, then it is advantageous to specify the bit-width to reduce 
program complexity. In some cases, it is necessary to specify the bit-width, e.g., when shifting left. 

A tuple is a small seguence. Unlike collections, tuples may contain different types and instance 
tokens that are memory states. The base type in memory cannot be a tuple or a collection. Collections 
may contain tuples, provided all of the tuples have the same seguence of types. 

Mitrion-C implicitly executes all instructions in parallel. This programming language lacks 
a sequential programming construct. By default, Handel-C executes instructions sequentially al- 
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though the "seq" keyword provides clear documentation. Data dependencies impose an order on 
computations. Yet, it seems unsatisfactory to never allow sequential computations unless there is a 
data dependency. 

FPGAs provide a memory hierarchy so that internal memories (Block RAMs) offer faster 
access than external memories. In the Mitrion-C programming environment, all reads and writes 
take the same amount of time without regard to the type of memory accessed. So, it is not possible 
to take advantage of the memory hierarchy. 

Registers cannot be accessed without performing a read or write operation. A variable may be 
assigned only once in a scope since there is no concept of order for statement execution. Consider 
the following sequence of valid statements inside a loop: 


c = 5; b = c*c; 


What is the order of execution? What is the value of b? A programmer may be surprised 
to discover that the answers to the preceding questions depend on whether or not c is a “loop 
dependent” variable. A loop dependent variable is always explicitly typed and initialized outside a 
loop. Only loop dependent variables may be returned from a “data-dependent” loop such as a for or 
while statement. The only variable that may be assigned a new value is loop dependent. A statement 
such as 


zezzct1 


is notlegalif z is not loop dependent since it is not permissible to assign z more than once in a scope. 
(What is the value of z on the RHS?) If c is not loop dependent, then the preceding sequence of 
statements execute sequentially due to the data dependency and the value of b is 25. However, if c 
is loop dependent and the statements appear in a data-dependent loop, then the statements execute 
in parallel. For instance, the value of b is 9 if c had the value 3 before the iteration of the loop body. 

The for and foreach expressions are loops over a collection. Unlike streams, a list or a vector 
defined outside a loop, can be referenced in the loop body. In version 1.3, only constants can be used 
to declare collections inside for and foreach expressions. It might seem that it is not possible to 
have a variable number of iterations. Functions can be "called" with different constants as Mitrion-C 
supports polymorphic functions. 

To run blocks of code in parallel, a programmer chooses vectors and uses the foreach keyword 
to unroll the code automatically, i.e., to make copies of the loop body. To implement a loop in a 
pipeline fashion, use for instead of foreach and iterate over vectors. In practice, a programmer 
uses only small vectors due to the limited availability of resources to duplicate the body of the loop. 

A common technique is to reshape a list into a list of vectors. Then iterate over the smaller 
vectors using a foreach inner loop. Using this technique, a programmer reduces the resource 
requirements and still achieves a degree of parallelism. This method is applicable whenever the 
problem can be decomposed in a suitable way. A good example is matrix multiplication. 
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An important tool in software development is a simulator with a graphical user interface 
(GUI). This simulator includes a "throughput analysis" tool that is sometimes useful in identifying 
bottlenecks in the program. 
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CHAPTER 5 


Case Study: Sorting 


Sorting, which is listed as a potential application for acceleration by FPGA coprocessors on the 
Cray [12], was one of the first, very common, algorithms that we successfully ported to the FPGA. 
In addition, sorting algorithms are often used in image processing and other scientific applications. 
The focus of this chapter is sorting on the Cray XD1, using the Mitrion-C and Xilinx tools. In 
particular, we target a Xilinx Virtex-4 LX FPGA [36]. 

A list of n elements is optimally sorted in O(n log (n) time on a sequential computer. 
There are well-known parallel algorithms for sorting in O(n log n) time (see for example, [18]). 
Since an FPGA must initially access every element one at a time to sort a list, the coprocessor 
requires at least Q (n) time to sort a list of n elements; whence, the maximum speedup in sorting a 
single list of n elements is bounded by O(log n). Thus, it is reasonable to expect speedup of much 
less than 100 fold as the number, and length of lists that can be sorted on an FPGA is limited. 

The speedup is reduced due to the overhead of loading the logic onto the target device, 
transferring data (lists) to the local memory of the FPGA, and writing results to the host memory. 
Unfortunately, the Cray XD1 cannot write to the local memory of the FPGA while the coprocessor 
is running. To reduce the overhead due to waiting for the FPGA to finish its task, the multiprocessor 
(SMP) may operate concurrently performing other tasks instead of remaining idle while the FPGA 
is running. 

As FPGAs operate in a regular fashion, the time an FPGA coprocessor takes to perform its 
task is easily determined. The challenge for the programmer is to keep the multiprocessor as busy as 
possible doing useful work while the coprocessor is running. Using this approach, speedup is possible 
even taking into account the relatively large amount of overhead. 

Presumably, the overhead will not substantially vary across different implementations as load- 
ing the logic, writing by the multiprocessor to the local memory of the FPGA, or by the coprocessor 
to the host memory occurring at constant rates. We shall focus on comparing different implemen- 
tations by measuring the time spent while the FPGA is running and ignoring the overhead. The 
goal is to find out what algorithms and programming techniques yield the best performance based 
on run time on the Cray XD1. To compare different implementations, define the speedup for each 
implementation to be the time spent sorting on an AMD Opteron processor using the C standard 
library function qsort () divided by the time spent sorting on an FPGA coprocessor: 


time running qsort on SMP 





d — 
speedup time running on FPGA 
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As sorting can be done in various ways, certain requirements are prescribed to conduct a more 
comparative study. Restrictions are imposed both on the method of sorting and on the type of 
results returned. Without loss of generality, assume the task is to sort a database based on one of the 
fields (keys). To access a particular item in sorted order, say the kt” item, in another field (column), 
the original position of the kt” key must be known. Hence, it is necessary not only to sort the keys 
but also to keep track of the original positions. 

The original location may be kept by attaching the original location (index) as a prefix to each 
key in the unsorted list. Define a new data type, say ELEMENT, TYPE, sufficiently large to hold both 
an index and a key. For example, if the length of a list of 16-bit unsigned integers is 128, then it is 
convenient to declare 


tdefine ELEMENT. TYPE uint:23 


because seven bits is adequate to hold an index between 0 and 127, and the keys are 16-bit (7416-23). 
By shifting an index, say index1, 16 bits to the left 


ELEMENT. TYPE prefixi = (ELEMENT TYPE)indexl << 16; 
and attaching the "prefix" prefix1 via addition, an "indexed key” 
ri- prefixitkeyi, 


is obtained. The prefixes (indices) must be ignored whenever comparing keys. Consider two indexed 
keys, say ri and r2, with type ELEMENT, TYPE. To check if r1's key is greater than r2' key, write 
the expression 


(DATA TYPE)ri1 > (DATA TYPE)r2 
where the data type for the keys is declared using a preprocessor directive: 
fdefine DATA TYPE uint:16 


When the sorting process is complete, only the prefixes are written back to memory, i.e., the keys 
are ignored. It is convenient to convert the prefixes to have the same type as the keys. To remove the 
keys and perform this conversion, the expression 


(DATA TYPE) (ri >> 16) 


where r1 has type ELEMENT, TYPE is used. 

The sorting algorithms studied are stable, i.e., two records maintain their relative order when 
the keys are the same. Although the function qsort () does not actually perform a stable sort, the 
speedup is nonetheless well-defined. 

In each implementation, a coprocessor is required to carry out a stable sort on each list of keys 
and return a corresponding list whose kt” entry yields the original location of the key that belongs 
in the kt position. So to find the element that belongs in the kth position for a list, look at the 
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index, say i, in the k'P position of the list returned. For instance, given the list <7, 7, 5, 2, 6, 
4, 1, 3>, the task is to determine the corresponding list <6, 3, 7, 5, 2, 4, 0, 1», which 
lists the original positions of the keys in sorted order (counting from zero). To obtain more reliable 
timing results, the input for each implementation consists of the largest number of lists that fit into 
the FPGAS external memory (ODRs). 

The width of the keys is fixed reasonably small to permit fairly long lists to be sorted. More 
specifically, every key is a 16-bit nonnegative integer k with 0 < k < 65534. The number of lists 
that be sorted by the coprocessor during a single run depends on the length of each list, which is 
varied to compare performance using the same algorithm and design techniques. 

Although the run time on an FPGA processor does not depend on the input, the time spent 
sorting on the host processor using gsort () does depend on the input. Thus, the speedup will vary 
depending on the input. To reduce this variability and produce consistent results, a large number of 
sufficiently varied inputs were used. 

The input for each test consists of “pseudorandom permutations" of the list 


«0, 1, 2, . . . , n-1> 


where n denotes the length of the list. A simple and fast way to generate pseudorandom permutations 
is well known. Choose an arbitrary permutation, initially. Successively generate permutations as 
described next. Uniformly randomly pick any position and swap the element in this position with 
the last element. After each swap, the rightmost element that was swapped is in its final position. 
Recursively, uniformly randomly pick a position corresponding to the elements excluding those 
elements in the tail that have already been placed and swap with the rightmost element excluding 
the tail. After n-1 swaps, a permutation is generated. This method is fast and produces permutations 
as if sampling from a uniform distribution [13]. 

To uniformly randomly pick positions, we employ a multiplicative congruential generator 
(MCG) compatible with the Cray Mathlib routine RANF [16]: 


Xn-1 4485709377909 x, mod 278 , 


Although this MCG has a long period, it is a decidedly poor choice [6], but it has worked well for 


our purposes. 
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A common way to improve performance is to execute sorts in parallel. Since the Xilinx Virtex-4 
LX FPGAs have four memory banks, it is possible to iteratively perform the following operations 
in parallel: Read four unsorted lists, sort four lists and write four sorted lists (back to correspond- 
ing memory locations). This strategy yields a fourfold improvement in performance. The external 
memories (QDRs) are dual ported, allowing simultaneous reads and writes. 
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Another way to improve performance is to carry out I/O operations while doing useful work. 
Pipelining is a good way to overlap computations with I/O operations. While sorting, it is possible 
to concurrently both write the previously sorted list and read the next unsorted list. 

To some extent, these I/O operations come for free. We describe implementations in which 
the coprocessor reads unsorted lists from the QDRs and writes sorted lists back to these external 
memories at corresponding locations. In a typical application, however, the FPGA would write the 
sorted lists sequentially to the host's memory. T'hus, there exists an asymmetry: The coprocessor can 
concurrently read four lists from the QDRs but can only write sequentially to the host’s memory. 

To remedy this imbalance, the coprocessor can pack four 16-bit words into a single 64-bit 
word so that it is possible to effectively write four lists concurrently to the host's memory. This 
packing is conveniently expressed as a function call 


vl = val 4x16to64(x1,x2,x3,x4); 
where the function val. 4x16t064() is defined in Figure 5.1. The host processor should also pack 


val 4xl6to64(vO, vl, v2, v3) 


1 

bits:16[4] v4x16 - [(bits:16)vO, (bits:16)vl, (bits:16)v2, 
(bits:16)v3]; 

bits:64 val64 = v4x16; 
) val64; 


Figure 5.1: Pack four 16-bit words into a 64-bit word. 


four 16-bit words into a single 64-bit word to efficiently pack more data into the FPGAS external 
memory. Then a coprocessor unpacks a 64-bit word into four 16-bit words. This unpacking is 
expediently expressed via a function call 


(x1,x2,x3,x4) = val 64to4x16( vl ); 


where the function val. 64to4x16 O is defined in Figure 5.2. 

Another strategy to improve performance is to carry out sorts not only in parallel (for each 
memory bank) but also concurrently (for the same memory bank). Sorting is a relatively slow process 
relative to the time it takes to read a list. After reading a list and commencing a sorting operation, it is 
possible to read another list and commence another sorting procedure (provided there are sufficient 
resources). 

Each list in a memory bank can be read sequentially to allow concurrent writes and avoid 
memory access conflicts. I/O operations may be performed in a round-robin fashion. Each read 
operation in this round-robin chain is associated with a separate piece of hardware configured to 
carry out a sorting operation. Thus, many sorts are scheduled in a "circular pipeline." A foreach 
“block expression” is suitable to implement such a circular queue. The programming challenge is to 
minimize the "idle" time for most of the configured resources. 
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val 64to4xl6(v64) 


1 

bits:16[4] v4xl6 = v64; 
vO - (DATA TYPE)v4x16[0]; 
vl = (DATA TYPE)v4x16[1]; 
v2 = (DATA TYPE)v4x16[2]; 
v3 = (DATA TYPE)v4x16[3]; 
) (v0, vl, v2, v3); 


Figure 5.2: Unpack a 64-bit word into four 16-bit words. 
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Consider a naive approach to implement a simple brute force algorithm such as bubblesort. A list is 
sorted after n-1 sequential scans of a list of length n. During each successive scan, adjacent elements 
are compared and swapped if necessary. In this way, elements "bubble" up or down to their final 
positions. The invariant of the algorithm is that after the kt” scan, the k” largest element is placed 
in its final location. 

A traditional programming implementation consists of nested for loops: An outer loop that 
controls the number of scans (passes) and an inner loop that sequentially performs the swaps for 
each scan. Unrolling the outer loop creates a pipeline. This means a separate piece of hardware is 
configured to perform each scan. The input for each such piece of hardware is the output of another 
piece in the pipeline. This program flow is highly regular, and thus suitable for implementation on 
an FPGA. Notwithstanding, each successive stage in the pipeline operates relatively faster as the 
lists become shorter. 

Writing all of the results after sorting is completed is impractical because a slow process at 
the end would become a bottleneck. Due to the invariant of the algorithm, the k stage may write 
the kt” largest element. Although having multiple stages write results may lead to contention, there 
is no contention in this case. 

A Mitrion-C code fragment that performs a single scan appears in Figure 5.3. The input for 
a bubblesort pass is a list (s), a memory reference, offset and position. The first for loop extracts 
the head of the list. The second for loop iterates over a list with one less element (s </ 1). The 
loop dependent variables are e and Left. No swapping is ever actually done. Instead, the second 
for loop appends the minimum after every comparison to form a new list (v), which is returned, 
and keeps track of the largest element seen. The largest element (last) is written at the specified 
memory location after the loop terminates because of the data dependency. 

How can an efficient pipeline be created using the function Bubblesort ()? Unfortunately, 
it is not possible to iterate over a collection (using a for loop) because the lists become shorter in 
each successive stage. T'he pipeline is created manually by a seguence of function calls. Each function 
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BubblesortPass( s, MemO, Start Offset, Position ) 


{ 
ELEMENT TYPE Left = 0; 
ELEMENT TYPE e = for ( a in (s </1) ) ( Left = a; ) Left; 
(v,last) = for ( enext in (s /< 1) ) { 
(Left, e) = if (e > enext) { } (enext,e) 
else { } (e,enext); 
} (>< Left, e) ; 
Lm = _memwrite( Memo, Start Offset + Position, last ); 
} (v, Lm); 


Figure 5.3: Function in Mitrion-C performing a single pass of bubblesort. 


Lml = foreach( list index in < 0 .. N-1 > ) 


{ 
uint:32 Offset = list index*LIST SIZE; 

(v1,Lm1) = BubblesortinitialPass( Memi, Offset ); 
(v2,Lm2) - BubblesortPass( vl, Lml, Offset, LIST SIZE-2 ); 
(v3,Lm3) - BubblesortPass( v2, Lm2, Offset, LIST SIZE-3 ); 
(v4,Lm4) = BubblesortPass( v3, Lm3, Offset, LIST SIZE-4 ); 
(v5,Lm5) - BubblesortPass( v4, Lm4, Offset, LIST SIZE-5 ); 
(v6,Lm6) = BubblesortPass( v5, Lm5, Offset, LIST SIZE-6 ); 
Lm7 - BubblesortLastPass( v6, Lm6, Offset ); 

) Lm7; 


Figure 5.4: Pipeline in Mitrion-C using function calls. 


call creates separate logic that is mapped to the target device. Synchronization is automatic by data 
dependency. 

A sample Mitrion-C code fragment to create a pipeline is given in Figure 5.4. The foreach 
block iterates over all of the lists in a memory bank (ODR). Each list has eight elements; whence, 
seven scans are needed. The initial scan takes place while reading the initial unsorted list. The final 
scan involves two elements which are written to memory after a single comparison. The observed 
speedup generally increases directly with the length of the list. l'here are sufficient resources on the 
chip to sort lists with up to twenty-two 64-bit unsigned integers. The maximum speedup is 9.7 [4]. 
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Selection sort is bubblesort with much less swapping. Each pass of selection sort scans the list for the 
next largest element and performs a single swap to place that element in its final position. A natural 
extension is to place the next largest and smallest elements in their final positions after each scan; 
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whence, half the number of scans are needed. An algorithm for 2-way selection sort is introduced 
in Algorithm 5.1. 


Algorithm 5.1 Algorithm 2-way selection Sort (L). 


Input: L alist 
Output: J indices for sorted list 
Requirement: — The length of L is even. 


1. Set S and T to be empty lists and M to be the given list L with each element assigned 

an index that is the position of the element in L. 

2. While M is not empty 

a) Set mr, = min(e;,e;) and mr = max(e},e,), where e; and e, denote the first and last 
elements in M. 

b) Let egi; and emax denote the minimum and maximum values in M. In the case of 
repeated values, choose egi; and eg so that the assigned indices are the smallest 
and largest, and let imin and imax denote the positions in M of emin and eg, 
respectively. 

c) Append emin to the end of S and insert emax at the beginning of T. 

d) Replace the elements in M at positions imin and imax with the elements mr, and 
mn, respectively. 

e) Remove the first and last elements in M. 


3. Return the assigned indices J for the list S|[T, where || denotes concatenation. 


Example. Apply 2-way selection sort to the list <0,2,7,4,5,2,1,4>. For illustration, we 
label duplicates using subscripts so that ax indicates this element is the kt” occurrence of the element 
a in the original sequence. The elements of S and T are underlined below. Note, we do not show 
replacements at the ends (step d) because those elements are removed (step e). 


0 2 7 41 5 22 1 4 
0 A 42 4 5 2$ 1 7 
0 1 4 4 3 2 5 7 
0.1. 0 i 5.429 5 7 
0 1 21 22 Aj 4 5 7 





The first steps of the while loop (steps a and b) permit speculative evaluation. The updates 
(steps c and d) may be carried out in parallel. The removal of elements in the last step (e) does not 
incur any cost since these elements may be ignored. 
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None of the built-in collections seem appropriate to manage the list M. Lists cannot be indexed, 
which is needed to perform the updates in the 2-way selection sort algorithm. To reduce the amount 
of resources required, an unrolled for or while loop is desirable. However, streams cannot be 
redefined in a loop and vectors require too many resources to perform updates using non-constant 
indices. 

The best data structure to manage the list M is the equivalent of an array stored in internal 
memory (not QDR). This array is created by calling the function _memcreate() to obtain an 
instance token, say MemIT, to perform read and write accesses, as in the following statement 


DataRAM MemIT - _memcreate( DataRAM last IT ); 
where the type declaration for block RAMs is given in the preprocessor directive 
#define DataRAM mem ELEMENT TYPE[ L ] 


in which L denotes the length of a list. These internal block RAMs permit a couple of accesses during 
each clock cycle. After the last array access, the instance token last. IT, passed as an argument to 
_memcreate(), is assigned the last memory reference, (ma), as in the following statement: 


last IT = ma; 


The loops are not unrolled so that there are sufficient resources to perform multiple sorts both 
concurrently (same bank) and in parallel (four banks). Concurrent sorts may be carried out using 
function calls to create a pipeline as illustrated in Figure 5.5. 

Since five lists cannot be read in parallel from the same external memory bank, a pipeline 
is created to seguentially read five lists. After the first list has been copied to an internal block 
RAM, the next list is read from external memory, and so forth, using a chain of function calls to 
CopyList (). Similarly, a chain of function calls to CopyListBack() sequentially writes five lists to 
external memory. After the first list has been sorted, it is copied to external memory. After copying 
a list to external memory, the next sorted list can be safely copied to the next locations in the same 
memory bank. Each sort procedure begins as soon as data becomes available. 

Implementing the while loop in the algorithm using for loops in Mitrion-C is tricky because 
the collection expression for loops must be a constant expression. A solution is to employ a constant 
stream whose shape matches the control structure of the loop. A code fragmentis shown in Figure 5.6. 

The inner for loop successively iterates 6, 4, 2,and 0 times corresponding to the successive 
lengths of the stream Pass variable. The dummy variable is not actually accessed in the loop body. The 
value of the constant Z is not used and its bit width is set to one to minimize resource requirements 
using a declaration 


const bits:1 Z = O; 


Implementing the while loop in the algorithm using a while loop in Mitrion-C is given in 
Figure 5.7. An implementation based on the while statement is about twice as fast even though 
less than half as many sorts are performed on the chip compared to an implementation using only 
for loops [4]. 
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(m11,m12,m13,m14,m15, // similar instance tokens omitted for second, 
// third and fourth banks 
111,112,113,114,115, // similar instance tokens omitted for second, 
// third and fourth banks 
wBl,wB2,wB3,wB4 ) = for( i in (. O .. N/M -1 .) ) { 
/* function calls for bank 1 only */ 
/* Copy external memory to block RAM */ 
(rdil,smb 11 ) - CopyList( rdBl, smb 11, iML ); 
(smb 11,slb 11 )-SelectionSort( smb 11 , slb 11 );/* Perform sort  */ 
/* Copy block RAM to external memory */ 
(slb 11,wrll) = CopyListBack( slb 11 , wrBl, iML ); 
(rdi2,smb 12 ) = CopyList( rdll, smb 12, iMLl ); /* Read next list */ 
/* Concurrently perform another sort */ 
(smb 12,slb 12 ) - SelectionSort( smb 12 , slb 12 
(slb 12,wrl2) - CopyListBack( slb 12 , wrll, iMLl 
(rdi3,smb 13 ) - CopyList( rd12, smb 13, iML2 ); 
(smb 13,slb 13 ) - SelectionSort( smb 13 , slb 13 
(slb 13,wrl3) - CopyListBack( slb 13 , wrl2, iML2 
(rdi4,smb 14 ) - CopyList( rd13, smb 14, iML3 ); 
(smb 14,slb 14 ) - SelectionSort( smb 14 , slb 14 ); 
(slb 14,wrl4) = CopyListBack( slb 14 , wrl3, iML3 ); 
(rdBl,smb 15 ) - CopyList( rdl4, smb 15, iML4 ); 
(smb 15,slb 15 ) - SelectionSort( smb 15 , slb 15 ); 
(slb 15,vrBl) = CopyListBack( slb 15 , wrl4, iML4 ); 
/* similar function calls for bank 2 in parallel omi 
returns instance token wrB2 */ 
/* similar function calls for bank 3 in parallel omitted; 
returns instance token wrB3 */ 
/* similar function calls for bank 4 in parallel omitted; 
returns instance token wrB4 */ 
iML-iML-ML; iML1=iML1+ML; iML2=iML2+ML; iML3=iML3+ML; iML4=iML4+ML; 
) (smb_11,smb_12,smb_13,smb_14,smb_15, // tokens omitted for second, 
// third and fourth banks 
sib ll,slb 12,slb 13,slb 14,slb 15, // tokens omitted for second, 
third and fourth banks 


tted; 


WrBl,wrB2,wrB3,wrB4); 


Figure 5.5: Sorting concurrently in a pipeline. 
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A parallel version of bubblesort is called an “odd-even sort.” Denote a list L with n elements by 


L = €89,84,..., Ap, Ap+i,---, An-1>. 
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Two Way Selection Sort( u ) 
{ // Initialization code (omitted) 


(ma,mb) = for( STREAM_BASETYPE (..) Pass in (. 
(-2,2,2,2,2,2.), 
(.2,2,2,2.), 
(.Z,Z.), 
(---) +) ) 
{ // Initialization code (omitted) 


INDEX_TYPE j=Start+1; 
(i Min,i Max,e Min i,e Max i, lm)-for( dummy in Pass ) 
{ // body (omitted) 
j=j +1; 
} (iMin, iMax, eMin_i,eMax_i, memA); 
// updates (omitted) 
Start=Start+1; End = End - 1; 
} (mema, memb ); 
// Termination code (omitted) 


} (ma,mb); 


Figure 5.6: A 2-way selection sort function in Mitrion-C using an inner for loop. 


Any pair of consecutive elements, say (ap, ap+1), is said to be even or odd if the position p of 
the first element of the pair is even or odd, respectively, e.g., (ao, a1) is an even pair whereas (a1, a2) 


is an odd pair. 


Algorithm 5.2 Algorithm parallel Bublesort (L). 


Input: L alist with n elements 
Output: J indices for sorted list 
Requirement: The length of L is even. 


1. Sort all even pairs in parallel. 


2. Sort all odd pairs in parallel. 
3. Repeat steps 1-2 sequentially exactly ” times. 
2 


Example. Apply parallel bubblesort to the list <0,2,7,4,5,2,1,4>. 


Two Way Selection Sort( u ) 
1 11 Initialization code (omitted) 


INDEX TYPE Start-0; 
(s,t,ma) = for( INDEX TYPE Pass in « 1 


{ 


j 


// body (omitted) 


j + 1; 


Pass; 


INDEX TYPE End = L 1; 
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. MIDR > ) 
1 // Initialization code (omitted) 
INDEX TYPE j - 
(i Min,i Max,e Min i, e Max i, Lm) - 
while(j « End) iterations 1 


) (iMin, iMax, eMin i, eMax i, memA); 
// updates (omitted) 


Start-Start-*1; End = End - 1; 
) (>< e Min i, >< e Max i, mema): 
// Termination code (omitted) 


) (s,t); 


Figure 5.7: A 2-way selection sort function in Mitrion-C using an inner while loop. 
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A vector is the only collection that permits parallel access. Unlike the even stages, the odd 
stages do not involve all elements. By partitioning lists into two vectors containing the even and odd 


elements, it is possible to iterate over pairs of vectors in a foreach loop to access all elements in 


parallel. The lists are formed when the unsorted lists are read from external memory and reformatted 
as vectors. The elements that are not used during the odd stages must be dropped and added back. 
Figure 5.8 shows a Mitrion-C code fragment implementing parallel bubblesort for a list with eight 
elements. The maximum speedup sorting lists with up to forty 64-bit unsigned integers is 30 [5]. 


5.5 COUNTING SORT 


Counting sort determines the position of an element by counting the number of smaller elements. 


For example, the minimum and maximum appear in the first and last positions, respectively. The 
count is unique whenever all elements are distinct. To force the algorithm to be stable in the case 
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(el,ol,e2,02,e3,03,e4,o4) = 
foreach( ELEMENT TYPE Indexl in <0,2,4,6> ) ( 
// code omitted: read an even element from each bank, 
// attach prefixes and label the results ai,ci,ei,gi. 
// code omitted: read an odd element from each bank, 
// attach prefixes and label the results bi,di,fi,hi. 
) (ai,bi,ci,di,ei,fi,gi,hi); 


ELEMENT TYPE[4] EvenVector - reformat ( e, [4] ); 
ELEMENT TYPE[4] OddVector - reformat ( o, [4] ); 
(x, w) = for ( EvenOddPasses in <1..4> ) 1 
/** Pass: even **/ 
(EvensEP, OddsEP) - 
foreach ( a,b in EvenVector, OddVector ) ( 
(p,q) = if (Ça 2 b) (b,a) eise (a,b); 
} (pra); 
/** Pass: odd **/ 
(OddsOP,EvensOP) - 
foreach ( a,b in  (EvensEP /« 1),(OddsEP «/ 3)) ( 
(p,q) = if (b» a) (a,b) eise (b,a); 
) (P,g); 
/* Add first and last elements back **/ 
(EvenVector, OddVector) - 


foreach ( a,b in ( EvensEP </ 1 ) >< EvensOP, 
OddsOP »« (OddsEP /« 3 ) ) 


{} (a,b); 


} (EvenVector, OddVector) ; 


Figure 5.8: Parallel bubblesort code fragment in Mitrion-C. 


of duplicates, the count includes the number of identical elements that appear earlier in the original 


list. 


Algorithm 5.3 Algorithm counting sort (L). 


Input: L alist 
Output: J indices for sorted list 


1. For each position in the list L, count the total number of elements less than the 
element appearing at this position plus the number of identical elements appearing 
before this position in L. 

2. Return the sums. 
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Example. Apply counting sort to the list <0,2,9,4,5,2,1,4>. By inspection, the counts 
are 0,2,7,4,6,3,1 and 5. This is done by counting the list elements in ascending value beginning 
at O, as shown in the following diagram: 


list «0 2 9 4 5 2 1 4> 


Counting 
o 


Counts:<0 2 7 4 6 3 1 5> 


CountingSort ( Meml, Offset ) 


{ 
list = foreach ( Index in <0.. L1>) { 
SHORT TYPE data = memread ( Meml, Offset+Index ); 
) data; 
v = reformat( list, [ L ] ); 
Last m = foreach (e in v by INDEX TYPE Pose) { 
INDEX TYPE sum = 0; 
Sum = for ( Pos_a, a in <0.. LI” , list ) { 
sum = if (a>e) { 
} sum else { 
le= if ( (a==e) && (Pos a >= Pose) ) { 
) sum eise ( 
S = sum + 1; 
) s; 
) 1e; 
) sum; 
m = memwrite ( Meml, Offset + Sum, Pos e ); 
) m; 
) Last m; 


Figure 5.9: Counting sort function in Mitrion-C. 


Figure 5.9 displays a counting sort function in Mitrion-C. Sequential counts are performed in parallel 
in a foreach loop over a vector. The counts are accumulated by iteration over a list in a for loop. 
The maximum speedup for lists with up to 50 64-bit elements is 32 [5]. 
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5.6 SIMULATION 


Before building and running code on hardware, a simulation is run to check correctness. If the 
simulation is not correct, then fix the errors, possibly redesign and rewrite part of the code. Then 
rerun the simulation. If the simulation is correct, then the code still might not run as expected on 
hardware due to bugs in the software tools, which should be reported. For example, implementations 
for matrix multiplication ran correctly only in simulation. The bug was reported and fixed in the 
most recent release of the software. 

The simulator is not clock accurate. The number of steps reported in batch and graphical 
user interface (GUI) modes is different because additional optimizations are performed in batch 
mode. A programmer might modify a code so that it runs faster during simulation. However, a 
programmer may observe that such improved code runs slower on hardware. The only valid measure 
of performance is the run time on hardware. 

A free simulator is available. This simulator displays a data dependency graph and optionally 
shows names and types. There is a natural correspondence between the data dependency graph and 
the program constructs. The advantage of using this simulator on a personal computer is that the 
option of visualization (GUI mode) is useful. Using the GUI mode when running the simulator 
remotely on the Cray XD1 is not practical as visualization runs slowly over the network. Using the 
free simulator it is not possible to specify a platform, simulate the FTR region, or handle complex 
types (such as packed data). 

Input from external memory banks for a simulation may be automatically read from specified 
ASCII files, either created beforehand, or seemingly generated randomly on-the-fly. For instance, 
to compile and simulate a source file called source.mitc in GUI mode targeting a Virtex II FPGA, 
enter the following commands: 


mitrion  -gui -jvm "-Xms512m -Xmx2400m" -sizes \ 
-platform xdivp50 -output out \ 
-input bl b2 b3 b4 -- source.mitc 


where bl, b2, b3 and bë are input files corresponding to the RAM banks, and out is the prefix 
for the corresponding output files. To run in batch mode, replace the -gui flag by the -batch flag. 
The number of input files must match the number of arguments to function main(). One method 
of creating input files is to develop a simple code to create the input and run a simulation to output 
the data to ASCII files. Stopping and resetting a simulation does not reopen files, which is a bug. 

The Throughput Analysis window in GUI mode that helped to spot bottlenecks in the design 
is no longer available as of release 1.5.1. To print estimates of performance statistics when compiling 
code on the Cray XD1, add the switch -1og SUMMARY on the command-line. For example, compile 
the source code source.mitc targeting a Virtex-4 FPGA use the command: 
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mitrion -jvm "-Xms512m -Xmx2400m" -sizes -gen-c-header \ 
-log SUMMARY -configure platform xdilx160-m2 \ 
source.mitc 


5.7. MEMORY CONVENTIONS AND INTERFACES 


Although there is a greater flexibility using the free simulator, there are precise restrictions on the 
number and order of arguments and return types of the function main () when targeting an FPGA. 
There must be four or more arguments and return values, each of which is a 64-bit memory reference. 
The kth memory reference corresponds to the kth external memory bank (QDR), k -1,2,3,4. An 
optional fifth reference corresponds to the host memory (F TR transfer region). Up to 32 additional 
64-bit memory references correspond to registers, accessible by the host and coprocessor. The copro- 
cessor can only read values (input registers) written by the host processor before starting the FPGA. 
The host can only read values (output registers) that the coprocessor writes after it finishes its task. 

A typical main () function declaration is given in Figure 5.10. The size of each external (QDR) 


// Type declaration for a register: 

#define Register uint: 64 

// Type declaration for FPGA external memory: 
#define ExtRAM mem bits:64[0x80000] 

// Type declaration of HOST memory 

#define HostRAM mem uint:64 [0x2000000000] 


(ExtRAM, ExtRAM, ExtRAM, ExtRAM, HostRAM, Register) 
main 
(ExtRAM m a, ExtRAM m b, ExtRAM m c, ExtRAM mem d, HostRAM 
Mem Host, Register ftr byteaddr, Register ftr offset ) 
{ 
Register ftr_array = ( ftr_byteaddr >> 3 ) + ftr_offset; 
/* code omitted; 
returns 64-bit value in last register, plus last 
memory references m Bl, m B2, m B3, m B4, and 
LMem Host */ 
Register LReg - last register; 
) (m B1, m B2, m B3, m B4, LMem Host, LReg ); 


Figure 5.10: Typical main () function in Mitrion-C. 


memory bank is specified as 0x80000 (0x100000 for release 1.4.1 or earlier) 64 bit words (4 MB). 
The size of the FTR transfer region must be specified as 0x2000000000 64 bit words (this value is 
required by the Mitrion-C compiler). The host processor writes the FTR address and offset into the 
input registers at offset 0x40 and 0x41. Additional input registers are written at offsets 0x42-0x5F. 
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The coprocessor reads the byte address and converts it into a quadword address and adds the base 
offset. The host processor accesses the output registers (such as LReg) at offsets 0x60-0x7F. 

The Cray API can be used to interface with the Mitrion Virtual Processor (MVP). The 
starting offset for the first 4 MB memory bank is 64 MB (0x4000000 bytes). The second, third and 
fourth banks are mapped to contiguous addresses immediately following the end of the first bank. 
Sample code using the Cray API is illustrated in Figure 5.11. 

Whenever the status is returned, errors may be handled using a statement as follows 


if (status < O) {/* handle error */ } 


not shown in Figure 5.11 to simplify the presentation. The Mitrion host abstraction layer (Mithal) 
API provides another interface to the MVP. The header (mithal_gen.hdr) is generated while 
compiling Mitrion-C code. Although there is no compelling reason to use the Mithal API instead 
of the Cray API, the reference document for the former API is useful because the latter API is not 
fully implemented for the MVP [50]. Compile C code (which uses FPGAs), say source.c, using a 


command line feed as follows 
gcc -o source -03 -Wall -m64 -D. REENTRANT 


-I/share/mitrion/mithal/include -I/usr/local/include \ 
-D.GNU SOURCE source.o -L/usr/local/lib64 -lufp -lpthread -lrapl 


5.8 GENERATING LOGIC AND RUNNING CODES 


To synthesize, load the appropriate modules (mitrion and xilinx), copy the VHDL file 
(user_app.vhd), generated while compiling Mitrion-C code, to a "user. app" directory, and 
change to the "place and route" directory. A bitstream that can be loaded onto a chip can then 
be created by using a makefile and the commands: 


make clean 
make dist. clean 
make top.bin 


To check a bitstream after synthesis, run: 
parcheck -nogui ./ 


in the scratch directory containing the generated files. Note that a program may run correctly even 
when the parcheck utility reports errors. 

On the Cray XD1, to run an executable on hardware submit a PBS batch file in the subdirectory 
of scratch containing the executable and bitstream files. To run an executable, fpga app, on a 
compute node with a Virtex II Pro FPGA write script file, fpga. app. pbs shown in Figure 5.12, 
and submit the job from the directory containing the executable 


qsub fpga app.pbs 


5.8. GENERATING LOGIC AND RUNNING CODES 


#define _ USE XOPEN2 
#define _XOPEN SOURCE 600 
finclude "ufplib.h" //add headers fcntl.h,argp.h,einlib.h, unistd.h 


define MEM SIZE (8 * 1024 *1024) 

tdefine MEM DISTANCE (8 * 1024 *1024) 

#define MEM OFFSET (64 * 1024 *1024) 

#define ARRAY SIZE 33423360 /* Size of array */ 
define TYPE VALUE OxOUL 

tdefine TYPE ADDRESS Ox1UL 

typedef unsigned long u 64; typedef  uintl6 t Element Type; 
Element Type * Bank[ 4 ]; void * ftr mem; 


int main (int argc, char** argv) ( 
int fpga id,status,num bytes,flags-O RDWR|O SYNC, 
mmap flags-PROT READ | PROT WRITE; 
err e err; size t length - MEM SIZE; off t Offset - MEM OFFSET; 
const char *loadfile = "top.bin.ufp"; 
fpga id - fpga open ("/dev/ufpO", flags, &err); // Open FPGA 
if (fpga id « O) (/* handle error */ ) 
// Load logic into FPGA (can also be loaded in PBS batch job): 


num bytes 


= fpga load(fpga id, loadfile, kerr): 


if (num bytes « O) ( /* handle error */ ) 

// Map QDRs into application address space: 

for (i20; i < 4; itt) ( Bank[i] = (Element Type*)fpga memmap(fpga id, 

length, mmap flags, MAP SHARED, Offset t i*MEM DISTANCE, kerr): 

if (Bank[i]==NULL){/* handle error */}} 

// Setup host memory for FPGA use: 

size t pagesize - getpagesize(); 

size t pages needed = ( ARRAY SIZE*sizeof (u 64) - 1)/ pagesize + 1; 

size t buf size - pages needed * pagesize; 


if 


int code - posix memalign ( &ftr mem, pagesize, buf size ); 
if (code !- 0) ( /* handle error */ ) 

// Register FTR region: 

fpga register ftrmem(fpga id, ftr mem, buf size, &err); 

if (err != NOERR) {/* handle error */ } 


status 
status 
status 


fpga reset(fpga id, &err); // Reset logic on FPGA 

fpga start(fpga id, &err); // Release the logic from reset 
fpga wrt appif val ( fpga id, 0x00000000000000001UL, (0x01* 
sizeof (u 64)) t O, TYPE VALUE, &err); // Stop FPGA 


// Provide FTR address via first input register: 


status 


fpga wrt appif val ( fpga id, (u 64)ftr mem, 
(0x40* sizeof (u 64)), TYPE ADDRESS, &err); 


// Provide FTR offset via second input register: 


status 


fpga wrt appif val(fpga id, ftr offset, 
(0x41* sizeof (u 64)),TYPE VALUE, kerr): 


// Clear processor state before running 


status 


status 


fpga wrt appif val ( fpga id, 0x000000000000000000UL, 
(0x02* sizeof (u 64)), TYPE VALUE, kerr): 
fpga wrt appif val ( fpga id, 0x000000000000000000UL, 
(Ox01 * sizeof (u 64)), TYPE VALUE, &err); // Start the AAP 


do (status = fpga rd appif val ( fpga id, (void*)&Value, 


return 0; 


(0x02*sizeof(u_64)), &err);) // Wait for FPGA 
) 


Figure 5.11: C code using Cray API. 
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!/bin/ksh 


#PBS -1 nodes=1:ppn=4#excl 
#PBS -j oe 

#PBS -1 walltime=0:30:00 
#PBS -V 

cd $PBS O WORKDIR 

./fpga app 


Figure 5.12: Sample PBS script. 


If a job is running a long time compared to the time it would take the host processors to accomplish 
the same task without using FPGAs, the job should be killed (gdel -Wforce job. number). In 
such cases, there may be a bug in the compiler. If a job crashes a node, an attempt has been made to 


use memory in some unsupported way. 
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Other high-level languages to program FPGAs include Dime-C [50] from Nallatech and Impulse- 
C [49] from Impulse Accelerated Technologies. Among the FPGA languages currently available, 
some are easier to learn than others. Notwithstanding, none of them completely remove the various 
hurdles a programmer must confront. 

The newest entry into the reconfigurable computing arena by Convey Computer Corporation 
is the HC-1, containing four Xeon CPUs with access to four Virtex-5 FPGA coprocessors [9]. The 
shared memory model integrates the CPU and the FPGA, eliminating some of the overhead of using 
FPGA technology. Programming the HC-1 involves using precompiled code and sophisticated 
compilers to configure the target device. This means a programmer becomes simply a user of the 
technology and does not write FPGA code and wait hours to synthesize a bit file that would later 
be loaded to configure an FPGA device. Instead, a programmer learns special directives that are 
wrapped around the code segments to be executed on the FPGA. Both FORTRAN and C are 
supported. 

How an FPGA on the HC-1 can be configured depends entirely on the "personality" — 
a set of packages of the precompiled parts. Currently, only a small number of personalities are 
available. Hence, programming is severely limited by the choice of a personality, which may be 
adequate for only some applications. To achieve exceptionally good performance, a personality likely 
provides more specialized as opposed to more general functionality. Casual programmers cannot 
develop a personality. For example, a team of engineers at Convey and scientists at the University 
of California San Diego developed a “Proteomics Search Personality" which yielded a 100-fold 
speedup in performance. The Proteomics Search algorithm conducts an unrestricted, blind search of 
a massive protein database to look for protein structure modifications. Such a search was not possible 
prior to the HC-1 [9]. 

Due to Moore's Law for FPGAs, [46] the number of resources and capabilities of FPGAs 
continue to steadily increase. Nevertheless, HPC still presents a significant design challenge for 
FPGAs. Larger functional blocks are required to handle the greater computational density found 
in supercomputer applications. Larger memory units are required to handle double precision values 
ubiquitous in HPC calculations. Faster clock speeds are required to keep pace with the CPU speeds. 
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Finally, a faster interface between the CPU and FPGA is required — tighter connectivity with the 
CPU, faster reconfiguration times and more I/O ports for streaming data to and from the host CPU. 

More generally, an FPGA is just one type of hardware accelerator currently available. PACT 
XPP technology is based on an array of units, which may be regarded as specialized FPGAs. Multi- 
cores, GPUs, Cell processors, array processors and other hardware accelerators are being offered to 
the HPC community in effort to provide faster computational platforms. The motivation for this 
diverse spectrum of accelerators is two fold — increased computational speed and decreased power 
consumption [51]. Yet, the costs for the hardware components and the software licenses cannot be 
ignored [60]. 

Multicore systems lie at one end of the spectrum of hardware accelerators. These systems 
not only provide increased speed but also reduced power consumption per core since less circuitry 
is needed than in a traditional multi-CPU single-core platform. The disadvantage is that the cores 
located on the same node of a multicore system compete for the limited node memory bandwidth, 
which can limit speedups. On the other hand, there are no software hurdles; the programming model 
is identical to that of the single core multi-CPU system. 

At the other end of the spectrum are FPGAs. FPGAs provide a significant reduction in 
power consumption, which is just a fraction of a traditional CPU [34]. With the potential massive 
parallelism provided by FPGAs, speedups can be on the order of what is equivalent to 100s or even 
1000s of cores. The limitation here is the programming effort. Not only is the programming effort 
a major obstacle, but licenses for these high-level languages can be on the order of thousands of 
dollars for a single seat. 

One may consider graphics processing units (GPUs) to lie in the middle of the spectrum 
between the two extremes of multicore systems and FPGAs [61]. GPUs are currently the most 
popular hardware accelerator. Although the total power consumption is typically double that of a 
traditional CPU (power savings is evident in terms of power consumption per GFLOPS) and the 
programming effort substantial, the cost is much reduced since it is based on commodity graphics 
cards and, hence, the concomitant economies of scale [34]. 

Nvidia has recently come out with its Tesla system [40]. It is the first graphics system to 
target the HPC community. The system combines two guad core Xeon's with four GPUs consisting 
of 240 cores. Programming is supported with CUDA, a language like C, with massively parallel 
constructs. 

ATI, now part of AMD, has come out with its FireStream product and Brook+ programming 
language while Intel is coming out with its own GPU, codenamed Larabee that will support the 
x86 instruction set [1, 37]. All of these systems have native support for IEEE double precision 
arithmetic reflecting a strong focus on scientific programming. On the other hand, these GPU 
programming languages are based on the stream-processing model. This model is most efficient 
when computations can be cast within a SIMD algorithm. In this sense, FPGAs are more flexible 
computational engines. 
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Even without hardware accelerators, the computational landscape is changing. Today's top 
HPC performing systems are on the order of hundreds of thousands of cores. Such systems are not 
only expensive to power but also to program. MPI codes that run without error on 128 cores often 
hang or crash on a multi-thousand core system [52]. Newer programming languages, like Titanium, 
Co-Array FORTRAN, and Unified Parallel C, are examples of PGAS (Partitioned Global Address 
Space) languages that hope to ease the cost of program development and take advantage of the latest 
hardware [38]. Also, MPI libraries, such as PETSc, are being employed to obviate explicit MPI 
programming [39]. 
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FPGA acceleration is a new technology that promises dramatic increases in computational perfor- 
mance for some applications. Because of their unique technological implementation, these devices 
aptly combine software and hardware into a reconfigurable architecture. The strongest advantage of 
using FPGAs is their ability to exploit data parallelism, instruction concurrency, and pipelining. 

Today, reconfigurable supercomputing is a reality, and major manufacturers of high perfor- 
mance computers are currently building machines with advanced FPGA acceleration units. Further- 
more, some preliminary experiments have shown that reconfigurable supercomputing offers increased 
performance of several orders of magnitude for selected examples. Unfortunately, to date, there is 
no easy way to accelerate a traditional parallel code using FPGAs. There is no magical compiler 
flag that generates ready to use reconfigurable code. Instead, programmers have the entire burden of 
deciding what tasks to assign to the coprocessor and when to start and stop the device, in addition to 
learning new programming tools and new programming paradigms. Ultimately, users must redesign 
and rewrite major portions of code. As such, reconfigurable supercomputing is complex and a time 
consuming endeavor. 

To advance the use of FPGAs, it is not enough to increase the performance and capabilities of 
the commodity components; integrated systems are needed. For example, more I/O ports are needed 
on FPGA device to advance parallel computations, especially systolic computations, to more fully 
utilize the device capabilities. Shared memory is an important integration to eliminate the overhead 
of communication costs. In particular, a system of a large array of FPGAs with relatively few CPUs 
is promising to advance fine-grain parallelism. 

Sophisticated compilers are also needed to shift the burden of scheduling tasks to run on 
the coprocessor. Using profiling and tracing, a programmer decides what code runs on which (co- 
)processor, writes all of the code, and the compiler performs scheduling and reconfiguration of the 
device as prescribed by the program. Shifting and rearranging the code to meet timing requirements 
is a task better suited to a compiler. 

Today, the HPC industry seems to be at the cross roads where long held traditions of both 
hardware and software are being challenged. It is highly likely that the changes observed in super- 
computing over the past couple decades will pale compared to the changes on the horizon in the 
next couple decades. 
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APPENDIX A 


search.c 


#include "einlib.h" 
#include <stdio.h> 
#include <stdlib.h> 
#include «fcntl.h» 
#include <argp.h> 
#include <string.h> 
#include <unistd.h> 
#include «time.h» 
#include <math.h> 


/* Configuration Parameters */ 

/* Note, the probability is per dimension. 
So, for a 4-D hypercube, the probility of point-hypercube 
intersection is PROB^4. 0.84174 = 50% */ 

#define PROB .841 

#define NBR_DIMS 4 

#define NBR_HYPERCUBES 64 

//#define NBR_POINTS 1000000 

#define NBR POINTS 256 * 1024 

#define VERBOSE. DEPTH 10 

#define REPEAT COUNT 500 


/* Derived Configuration Parameters */ 
«define RESULT FIELDS (((NBR HYPERCUBES-1)/64)*1) 


/* Define the FPGA write types (address or data value) */ 
tdefine TYPE VAL  OxOUL 
tdefine TYPE ADDR Ox1UL 


/* Define the addresses of the FPGA Registers */ 
#define MEGA 1024 x 1024 
"define XY REG ( O * MEGA) 
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#define RESULT. REG ( O * MEGA) 
tdefine RECT CFG REG (64 * MEGA) 
tdefine RESULT. PTR REG (64 * MEGA t Ox400UL) 


/* C++ wannabe */ 
typedef char bool; 
#define true 1 
#define false 0 


/* Declare types for unsigned integers. */ 
typedef unsigned long u_64; 
typedef unsigned short u_16; 


typedef struct 1 
u. 16 d[NBR_DIMS] ; 
} point; 


typedef struct 1 
u 16 min; 
u 16 max; 

) dim; 


typedef struct 1 
dim d[NBR_DIMS]; 
} hypercube; 


typedef struct 1 
u_64 f[RESULT FIELDS]; 


) result; 
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/* Commmand line parsing */ 
J[okokokokekokokekekokokekokokolokokolekokokolokokololokokolokokoolokokolelokokolokokolelokokolelokolelekokolelololelelololeleloleleelolelelokek / 
static struct argp option options[] = 1 

{"verbose", ’v’, O, 0, "Produce verbose output"), 

{0} 
y; 


static error_t parse_opt (int key, char *arg, struct argp_state *state); 


Struct arguments 
1 

int verbose;  /* The -v flag */ 
Të 


static char args doc[] = ""; 
static char doc[] - "Demonstration of a Celoxica design."; 
static struct argp argp = (options, parse opt, args doc, doc}; 


/* Provide a function to parse the parameters. */ 
static error t parse opt (int key, char *arg, struct argp state *state) 
1 


struct arguments *arguments - state->input; 


switch (key) 1 

case ’v’: 
arguments->verbose - 1; 
break; 

case ARGP. KEY ARG: // no arguments accepted 
if (state-»arg num >= 0) 1 
argp.usage(state); 
y 
break; 

case ARGP_KEY_END: 
break; 

default: 
return ARGP ERR UNKNOWN; 


return 0; 


int print err (err.e e) 


switch (e) 1 

case NOERR: 
printf ("Success.\n") ; 
break; 


65 


66 SEARCH.C 


case FILEOPRERR: 
printf("File operation system call failed. Wn"); 
break; 

case INVALIDOP: 
printf("Invalid API operation requested.\n"); 
break; 

case INVALIDVAL: 
printf("Invalid value passed to the API call.\n"); 
break; 

case INVALIDARGS: 
printf("Invalid argument passed to the API call.\n"); 
break; 

case INVALIDINP: 
printf("Invalid input given to the API call. Nn"); 
break; 

case DEVOPRERR: 
printf ("FPGA device operation error.\n"); 
break; 

case UNKNOWNERR: 
printf ("Unknown error.\n") ; 
break; 

default: 
break; 

y 


return 0; 


/ okokokokkkokokeokokkokkolelofokekekelokeokekekelelokefekekelelelejeekekeleleleekekelelelelelekelelelejekekelelelelelelekelelelejeleerelek / 
/* init_data_structures */ 
/* populate the input data structures with points & hypercubes */ 
/ okokokokkkkokeokokkokkokekofokekokeloleokekekolelekeekekelelelejelekekeleleleekekelelelelelerelelelejekeelelelelelelekelelelejeleerelek / 
int init_data_structures(struct arguments *arguments, point *points, 
hypercube *hypercubes) 1 
u.16 di, d2, width, start max; 
int i,j; 


for( i = 0; i < NBRPOINTS; i++) £ 
for( j = O; j < NBR_DIMS; j ++) { 


di = (u. 16)random(); 
points[i].d[j] = dî; 


width (-1); 
width = width * PROB; 
start_max - (-1 - width); 
if (arguments->verbose) 1 
printf(" start max-408x  width-/08x Mn", start max, width); 
y 
for( i = O; i < NBR HYPERCUBES; i ++ ) 1 
for( j = 0; j < NBR_DIMS; j ++) { 


di = (u_16)( random() % (start max*1)); 
d2 = di + width; 
hypercubes [i] .d[j].min = d1; 
hypercubes [i] .d[j].max = d2; 


if (arguments->verbose) { 
printf ("Displaying first Zi points and hypercubes: Wn", 
VERBOSE. DEPTH) ; 
point * p - points; 
for( i = 0; i < VERBOSE DEPTH; i ++ ) { 


printf( " point[^i]Nn",i ); 
for( j = 0; j < NBRDIMS; j ++ ) 1 
printf( " dim[%i] %04X\n", j, (*p).d[j] ); 
y 
p tt; 


hypercube *s - hypercubes; 
for( i = 0; i< VERBOSE DEPTH; i ++ ) ( 


printf( " hypercube[/4i] Nn", i ); 
for( j = 0; j < NBRDIMS; j ++ ) 1 
printf (" dim[Ai] £404X:404X)An", 


j, (*s).d[j].min, («s).d[j].max ); 
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int 


Stt 


return(0); 


search hw(struct arguments *arguments, int fpga id, point *points, 
hypercube *hypercubes, result **resultsPtr) 1 

err e e; 

inb ds Jj; 


/* Allocate memory where FPGA can store results */ 
result *results temp; 

results temp - fpga set ftrmem(fpga id, 9, &e); 
*resultsPtr - results temp; 


/* Declare a pointer to the "hypercube" memory in FPGA */ 
int hc mem size = NBR HYPERCUBES * sizeof (hypercube) ; 
u. 64 xhcdst = fpga memmap(fpga id,hc mem size, PROT READ | 
PROT. WRITE, MAP SHARED, RECT. CFG REG, &e); 


/* Declare a pointer to the "point" memory in FPGA */ 

int pnt mem size - 64 * 1024 * 1024; 

u. 64 *pointdst = fpga memmap(fpga id, pnt mem size, PROT READ | 
PROT WRITE, MAP SHARED, XY REG, &e); 


/* Load the hypercubes to be searched into the FPGA */ 
printf(" Loading hypercubes into FPGA\n") ; 
memcpy(hcdst, (u 64 *)hypercubes, hc mem size); 


/* Optional test to make sure that the hypercubes are loaded */ 
if (arguments->verbose) 1 
u 64 read data, expected; 
fpga rd appif val (fpga id, &read data, RECT CFG REG, ke); 
expected = *((u_64*) (hypercubes)) ; 
if ( read_data != expected ) { 
printf (" FAILED TO LOAD FPGA. Expected: %0161X Received: " 
",Z0161X\n", expected, read data): 


int 
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return(1); 
} else 1 

printf(" Hypercubes loaded successfully\n") ; 
y 
H 


/* Search for intersects - send each point to the FPGA (the FPGA will 
write results back into processor memory) */ 
printf (" Searching\n"); 
for( i = 0; i < REPEAT COUNT; i ++) { 
// tell the FPGA where to store the results 
fpga_wrt_appif_val(fpga_id, (long)results_temp, RESULT_PTR_REG, 
TYPE_ADDR, &e); 
memcpy(pointdst, (u_64*)points, NBR POINTS * sizeof(point)); 


// this will force the FPGA to flush any remaining results 
fpga wrt appif val(fpga id, OxO, RESULT PTR REG, TYPE VAL, &e); 
printf(" Finished\n") ; 


if (arguments->verbose) { 
printf (" Displaying first fi results:\n", VERBOSE DEPTH); 
result * r - results temp; 
for( i = 0; i< VERBOSE DEPTH; i ++ ) { 


printf( " point [%i]\n",i ); 
for( j = 0; j < RESULT FIELDS; j ++ ) { 
printf( " result[%i] %0161X\n", j, (*r).f[j] ); 
y 
qc. 
h 
y 
return(0); 


search sw(struct arguments *arguments, point *points, 
hypercube *hypercubes, result **resultsPtr) 1 

result *results; 

results - malloc( NBR_POINTS * sizeof(result)); 
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*resultsPtr - results; 


int 


r6, is Bis: KG 


u_16 pnt, min, max; 
bool hit; 


point *p; 


result *r; 


hypercube *s; 


printf(" Searching\n") ; 

for( rc = 0; rc < REPEAT COUNT; rc ++ ) { 
p = points; 

r = results; 


for( i = 0; i < NBR POINTS; itt, prit, r++ ) { 


s = hypercubes; 
(*(u_64*) (r)) = 0; // clear memory 
for( j = O; j < NBR HYPERCUBES; j++, stt ) ( 


hit = true; 
for( k = 0; k < NBR_DIMS; k ++ ) (1 
pnt = (*p).d[k]; 
min = (*s).d[k].min; 
max = (*s).d[k].max; 
if( (pnt < min ) || (max < pnt) ) 1 
//printf( "><> Point Ai is outside of hypercube %i on " 
// "dimension 4iAn", i, j, k ); 
hit - false; 


break; // don't bother checking the remaining dimensions 
// of this hypercube 


y 
y 
// update results 
ift hit ) 4 
//printf( "><> Yippie! Point Zi intersects hypercube" 
// Ul nt. 34 J dë 
(*r).f[j/64] += ((u_64)1) << (j 4 64); 
y 
y 


H 
printf(" Finished\n") ; 


if (arguments->verbose) { 
printf (" Displaying first Ai results:\n", VERBOSE DEPTH); 
result * r - results; 
for( i = 0; i< VERBOSE DEPTH; i ++ ) { 


printf( " point[/^i]Nn",i ); 
for( j = 0; j < RESULT FIELDS; j tt ) { 
printf( " result izil %0161X\n", j, (*r).f[j] ); 
J 
qo WES 
y 
T 
return(0); 


int validate(result *resultsSw, result *resultsHw) 1 
result *rs - resultsSw; 
result *rh - resultsHw; 
u 64 rfs, rfh; 
int i, j; 
for( i = 0; i < NBR POINTS; i++, rstt, rh++ ) { 
for( j = 0; j < RESULT FIELDS; j ++ 2 d 
rfs = (*rs).f[jl; 
rfh - (*rh).f[j]; 
if (rfs ie rfh ) { 


printf (" FAILURE: Mismatch between hardware & software " 
"results for point %i\n", i); 
printf(" s/w: 40161X h/w: %0161X\n", rfs, rfh); 
return(1); 
3 
y 
h 
printf(" Hardware and software results match.Xn"); 
return(0); 
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Joe oe ooo / 
/* On with the main program ... */ 
Je oe ooo / 
int main (int argc, char **argv) 1 

Struct arguments arguments; 

int fpga id; 

err e e; 


/* Parse any command line options and arguments. Store them in */ 
/* the arguments structure. */ 

arguments.verbose - 0; 

argp_parse (&argp, argc, argv, 0, 0, &arguments); 


/* Open the FPGA device */ 
fpga id = fpga_open ("/dev/ufpO", O_RDWR|O_SYNC, ke): 
if (e !- NOERR) 1 
printf ("Failed to open FPGA device. Exiting.\n"); 
return(1); 


/* Declare & initialize structures which hold input data */ 

point *points = malloc( NBR. POINTS * sizeof(point)); 

hypercube *hypercubes - malloc( NBR HYPERCUBES * sizeof(hypercube)); 
printf ("\nGenerating sample data for points & hypercubes Wn"); 
init_data_structures(&arguments, points, hypercubes ); 


/* Declare pointers to result data */ 
result * resultsSw; 
result * resultsHw; 


/* Perform search in hardware */ 
clock t ti - clock(); 
printf ("\nSearching for point-hypercube intersections " 
"using hardware\n") ; 
search hw(&arguments, fpga_id, points, hypercubes, &resultsHw); 
clock_t t2 = clock(); 
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/* Perform search in software */ 

clock t t3 = clock(); 

printf ("\nSearching for point-hypercube intersections " 
"using software\n") ; 

search sw(&arguments, points, hypercubes, &resultsSw) ; 

clock_t t4 = clock(); 


/* Make sure the hardware and software results match */ 
printf ("\nValidating results\n") ; 
validate(resultsSw, resultsHw) ; 


/* Analyze results */ 

printf ("\nStatistics:\n"); 

double time hw = (double) ( (t2-t1)/(CLOCKS_PER_SEC/1000) ) ; 
double time sw = (double) ( (t4-t3)/(CLOCKS_PER_SEC/1000) ) ; 


printf (" Hardware time: %8d ms\n", (int)time hw); 
printf (" Software time: 48d ms\n", (int)time sw); 
printf (" Factor: /44.2fNn", time sw / time hw); 


/* Close the FPGA device */ 
fpga close (fpga id, &e); 
// TODO cleanup memory 


return 0; 


APPENDIX B 


search.hcc 


[RR OR OR AR OR oke ke ok ke I RK KR oko ok k kkk kkk k kkk kkk k kkk k kk k kkk k K ak 


* Copyright (C) 2005 Celoxica Ltd. All rights reserved. * 
FFOI IG AA oo ooo 


* * 
* Project | NRL Eval * 
* Date H 01 Feb 2006 * 
* File : main.hcc * 
* Author ; David Gardner * 
* * 
* Description: * 
* _ Point-Hypercube intersection search * 
* * 
* Date Version Author Reason for change * 
* * 
* 16 Feb 2006 0.1 DG Original version * 
* * 


BAA RR IR kk ke koe RA AR OK Ok I A CK 2K e I A kA kkk kkk 2k ak 2 ak a ake / 


#include <cray_xd1.hch> 


// Clock Configuration - Allowable frequencies are from 63MHz to 199MHz 
#define CLOCK_RATE 140000000 


// Useful constants 

#define NBR HYPERCUBES 64 

#define NBR_DIMS 4 

#define BYTES_PER_DIM 4 

#define BUS_WIDTH 8 

#define HC MEM SIZE (NBR HYPERCUBES * NBR DIMS x BYTES. PER DIM / BUS. WIDTH) 
#define MEGA (1024 * 1024) 
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// Configure FPGA Address space - 128MB 


// 

Ji Notes regarding address selection 

// 

JI 1. It's probably best to cover the entire 128MB address space. 

// Otherwise, an errant read to an unused memory area could cause 
/ a kernel panic. 

// 

JI 2. Addresses are given as "byte-address". The PSL will drop 3 bits 
// as appropriate to translate byte addresses to quadword addresses. 
// 

// 3. BASE and TOP addresses are inclusive. Hence, the "-1" on TOP 
// addresses. 

// 

40 4. The address spaces for all callbacks must not overlap. 

LI In particular, if a read is performed in an address space that 
// is overlapping, the results will be unpredictable (possibly 

// kernel panic). 

// 

// 5. The low 64MB of address space is "Memory Space" and is subject 
// to "write combining". Address spaces that may be adversely 

// affected by write combining should be at or above the 64MB 

// boundary. 

// For more information refer to the "Processor requests" section 
// of the Cray XD1 FPGA Development document. 

hl 

#define TOP ( 128 * MEGA ) 

#define CONFIG_TOP ( TOP - 1 ) 

#define CONFIG_BASE ( 64 * MEGA ) // 0x4000000 
#define POINTS_TOP ( CONFIG BASE - 1) 

#define POINTS BASE ( 0 * MEGA ) // 0x0000000 


// Boilerplate code to setup the FPGA clock and reset signals 
tifdef SIMULATE 
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set clock - external; 
#else 
interface port_in (unsigned 1 user_clk) user_clk_if (); 
interface port_in (unsigned 1 reset_n) reset_n_if (); 
set clock = internal user_clk_if.user_clk with { 

rate = (CLOCK_RATE / 1000000) }; 
set reset = internal !reset_n_if.reset_n; 
#endif 


// Synchronization flags 
static signal unsigned 1 newPointNextCycle = 0; 
static unsigned 1 resultRdy = 0; 


DCRR RO 2 2 kk kk A k kkk kkk k kkk ke OR AK Ra O O OR k kk k kkk ak K kk 


* Configuration Registers 


* 
* Addr Function 

fee ———— HU À———— LI u 
* 0-63 Hypercube Buffers 

* This is where we manage the list of hypercubes that we're 


* searching through. (This is actually a FIFO, not memory 
* mapped.) 

* 64 Write: The address to processor memory where results are 
* to be stored 

* 


xkokookekkokolokeookekekolokejoekekolelejoekekeleleleleekelolelejekeketelelelelelekelelelelekekelelelelelerelelelelelelerek / 
// pointer to processor memory where we?ll store our results 
unsigned 40 resultPtr = 0; 
unsigned 64 hypercubes[HC_ MEM, SIZE]; 


macro proc Shift(arry, newVal, DEPTH) € 
par i 
arry[DEPTH-1] - newVal; 
par ( i = 0; i < DEPTH-1; i++) ( 
arrylil = arryli*1]; 


} 
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macro proc WriteConfigData(addr, data, byteMask) 1 

if ( addr[7] == 0 ) par 1 
// The processor passes us a 64 bit value which contains 
// hypercube information. 
// As we receive each new datum, we'll shift the previous 
// datums through the array 
Shift(hypercubes, data, HC MEM, SIZE); 

else 1 
resultPtr - data «- 40; 
RTWriteSetAddr(resultPtr); // this will force a flush if needed 


// Just for debug purposes, let's support reading from the end of our 
// hypercube shift register 
macro proc ReadConfigData(addr, data) 1 

data - hypercubes[0]; 


k k k ak ak ak ak 3k ak ak ak 3K K ak aK K 3K K aK a 3K 3K K aK 3K 3K 3K aK ak 3K 3K 3K aK aK 3K 3K K aK 2 3K 3K aK aK 3K 3K K aK aK K 3K 3K K ÞK 3K 3K K ÞK ÞK 2K 2K K 
* Point input macros 
* 
xokokokookekokolokokolokekeleloloekekeloleleleokekelelelelekekekeleleleleekeleleleleleleleleleleleeteleleleleletelelelelelelek / 

unsigned 16 x, y, Z, a; 

unsigned 64 searchResult; 


macro proc WritePointData(addr, data, byteMask) 1 


par 1 
a = data[63:48] ; 
z - data[47:32]; 
y = data[31:16]; 
x - data[15:0]; 


newPointNextCycle - 1; 


macro proc ReadSearchResult(addr, data) 1 
data - searchResult; 


[RR A A oko RR 4 4 KR OR A ok ke ok ke ok k kkk k kkk ek k 2K k kkk k 2 ak 


* Search macro 


This is where the real work gets done. As new points are 
written, they are compared against the hypercubes which were 


* * * * 


previously loaded into the FPGA. 
oko kokokoloeokekkolekeolekekolelejoekekeleleleleekeloleleleokekekeleleleelekelelelelelekelelelelelerelelelelelelerek / 


macro proc RunSearch() 1 


// TODO This macro assumes 64 hypercubes with 4 dimensions each... 
// Perhaps it should be rewritten to automatically configure at 
// compile time. 

unsigned 1 dimTest [NBR_HYPERCUBES] [NBR_DIMS*2] ; 

unsigned 1 result [NBR HYPERCUBES]; 

static unsigned 1 rdyFlag[3] = {0,0,0}; 


while(1) par 1 


// --- Pipeline stage O --- 

// Test min & max conditions of each dimension of each 

// hypercube simultaneously 

rdyFlag[0] = newPointNextCycle; 

rdyFlag[1] = rdyFlag[0]; 

par( i = 0; i < NBR HYPERCUBES; i ++ ) { 
dimTest [i] [0] = hypercubes[i*2+0][15: 0] <= x; 
dimTest [i] [1] = hypercubes[i*2*0][31:16] >= x; 
dimTest [i] [2] = hypercubes[i*2*0][47:32] <= y; 
dimTest [i] [3] = hypercubes[i*2*0][63:48] >= y; 
dimTest [i] [4] = hypercubes[i*2*1][15: 0] <= z; 
dimTest [i] [5] = hypercubes[i*2*1] [31:16] >= z 
dimTest [i] [6] = hypercubes[i*2*1][47:32] <= a; 
dimTest [i] [7] = hypercubes[i*2+1] [63:48] >= a 
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// --- Pipeline stage 1 --- 

// Consolidate the dimTest flags into one result bit per hypercube 

rdyFlag[2] - rdyFlag[1]; 

par( i = 0; i < NBR_HYPERCUBES; 

result[i] - dimTestlillol && 

dimTest[i][1] && 
dimTest[i][2] && 
dimTest[i][3] && 
dimTest[i][4] && 
dimTest[i][5] && 
dimTest [i] [6] && 
dimTest [i] [7]; 


itt) 


// --- Pipeline stage 2 --- 

// Reformat the results into a quadword and signal the 

// "RunResultStream" to send that quadword to the processor 
resultRdy - rdyFlag[2]; // signal the downstream logic 

// TODO: make this a macros and give my fingers a rest. 


SearchResult - 

result[63] € result[62] 9 result[61] 6 result[60] 0 
result[59] € result[58] € result[57] 6 result[56] 6 
result[55] € result[54] € result[53] 6 result[52] 0 
result[51] € result[50] € result[49] 6 result[48] 0 
result[47] € result[46] € result[45] 6 result[44] 0 
result[43] € result[42] € result[41] 6 result[40] 0 
result[39] € result[38] € result[37] 6 result[36] 0 
result[35] € result[34] € result[33] 6 result[32] 0 
result[31] € result[30] € result[29] @ result[28] 0 
result[27] € result[26] € result[25] 6 result[24] 0 
result[23] € result[22] € result[21] © result[20] 0 
result[19] € result[18] € result[17] © result[16] 0 
result[15] € result[14] € result[13] © result[12] 0 
result[11] € result[10] 9 result[ 9] 6 result[ 8] 0 
result[ 7] € result[ 6] € result[ 5] 6 result[ 4] 0 
result[ 3] 6 result[ 2] € result[ 1] 0 result[ 0] 
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macro proc RunResultStream() í 
while(1) par í 
if ( resultRdy ) par 1 
// send the result to the processor 
RTWriteSequential( searchResult ); 
} else delay; 


Joe e o o kkk kkk kk kk kk kk kk kkk kk kk kk kk kk 
* main 
aaao ooo kkk kk kk kkk / 

void main() 1 

par { 
// Run the Rapid Transport "driver" 
RunRTIf () ; 


// Run the application specific logic 

RunRTClient( POINTS_BASE, POINTS_TOP, ReadSearchResult, 
WritePointData ); 

RunRTClient( CONFIG_BASE, CONFIG_TOP, ReadConfigData, 
WriteConfigData ); 

RunSearch() ; 

RunResultStream() ; 
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